Today marks the inaugural blog featuring interviews with some of the upcoming presenters for Hadoop World Europe. Kicking things off are Clemens Neudecker of the National Library of the Netherlands and Sven Schlarb of the Austrian National Library. Clemens and Sven will be presenting The Elephant in the Library; a unique look at the role Apache Hadoop is playing in the mass digitization of cultural heritage taking place in the library, museum and archives sector.
Tell us about your presentation.
Clemens: We will first introduce the activities of the libraries, museums and archives sector in the area of digitization of cultural heritage: since mass-digitization started some years ago, there are millions of pages being digitized each year. We then zoom in on the particular scalability challenges linked to digitization such as the size and amount, but also the complexity and diversity of digital objects we have to deal with. A short explanation of how the EU research project SCAPE contributes to addressing some of these challenges in the digital preservation realm will then bring us to Hadoop and how it is becoming of central importance to today’s digital library ecosystem. Finally, we will give some examples of where and how we aim to use Hadoop in addressing the scalability challenges, and some initial results.
Sven: Libraries are facing fundamental changes in their day-to-day business due to the growing amount of digital information they have to manage. Hadoop plays a key role in this, and we will give you an overview on what scalability means in this context and share insights into some typical library MapReduce data processing scenarios.
What do you expect will be the single biggest takeaway of your presentation for attendees?
Clemens: We like to think the challenges we face and the solutions we are developing as a common issue where memory organisations are only forerunners. That’s why we immediately publish all our research on the web at http://www.scape-project.eu/downloads and maintain a public github repository for the sources we develop: https://github.com/openplanets. If other people face similar issues and find that our research benefits their activities – that would be truly great! In turn, that might even lead to some patches and contributions to our software frameworks, or new projects and initiatives being spawned. But even just raising awareness and understanding for our aims and challenges – and how and where Hadoop can help with these would already be an excellent outcome.
Sven: Managing and preserving print publishing was traditionally the libraries’ main task. We will make clear that the library’s shift to the digital realm is absolutely relevant to all of us. People will see that open source solutions like Apache Hadoop and new business models based on open source software are real opportunities for developers and solution providers for the public sector.
Tell us about your current role and how you interact with Hadoop.
Clemens: Currently I work as a Technical Coordinator for EU projects in the Innovation & Development department of the National Library of the Netherlands. Our research mainly focuses on three things: making our (digital) content more accessible, making sure it is still accessible also in the (very) long term, and experimenting with new software tools that can be used to work with our data. In particular our digital archives will soon outgrow the Terabyte scale – we have therefore been looking more and more into what Hadoop can bring to the table for addressing challenges linked to scaling up. At this point we are still in a phase where we experiment with Hadoop a lot to see it’s aptitude for several computing tasks, but I expect working with Hadoop will already be pretty standard in digital libraries in less than 5 years to come.
Sven: I work as a researcher and software developer at the Austrian National Library. Together with people from other European libraries, I am developing open source solutions in the area of digitization and long-term preservation.
What are you most looking forward to at Hadoop Summit?
Clemens: Definitely meeting with the community of Hadoop practitioners, listening to some great minds and inspiring examples of Hadoop use, and just generally developing a closer relationship with the Hadoop world. We also love to share our ideas with others and hope to get a lot of valuable feedback on what we’re doing.
Sven: I am simply looking forward to diving into the Hadoop world for the two days of the conference as deep as possible!
What other presentations are you most looking forward to attending?
Clemens: There are really a great number of interesting talks at this Hadoop Summit, which makes it very hard to choose. I’m personally most looking forward to hearing about the future of Hadoop from some of the big names like Arun Murthy, but I’m also really keen on the talks on industry use of Hadoop, such as from Twitter, LinkedIn and Facebook.
Sven: There are an amazing variety of industry practitioners, scientists, and Hadoop experts. Because of the setup of our Hadoop environment at the Austrian National Library, I have some special interests in the Pig and Hive talks. I’m also interested in some of the talks from the Cloudera presenters. It will be difficult to choose between so many exciting talks.