Continuing our series of interviews and viewpoints from speakers at Hadoop Summit Europe, this interview is with Claudio Martella, Researcher, VU University Amsterdam who will be speaking on “Apache Giraph: large-scale graph processing on Hadoop” as part of the ‘Committer’ track on Day 2, at 4:20pm.
HS: Tell us a little about your session.
Graphs are very simple and neat data structures, but they can also be “tough”. Information is distributed across millions of entities and all their relationships, and you can only get a good grasp of this information if you look at the whole picture. This often means large and expensive computations.
Large-scale graph processing for graph analytics has been a hot topic for a while already, in particular in the academia. However, in the industry and in the open-source community graph databases have been the tools that have allowed people to play with graphs until today.
We are starting to see more and more tools building up momentum for what concerns running analytical workloads for graphs. I’m thinking of course about Apache Giraph, but also GraphLab, and Hama. They fill a quite different spot in the space of graph processing compared to graph databases.
Whereas graph databases allow to run queries on small portions of a graph and return results within milliseconds (mostly for transactional workloads), systems like Giraph are designed to run large computations on massive graphs across hundreds of machines.
In my session I will outline what makes graph computations particular, and how Giraph was designed to execute them efficiently. Giraph has reached a good state of maturity and it includes a number of interesting optimisations and nifty features, comprising also of integrations with various data stores. I will show how Giraph makes large-scale graph processing easy, and how it can fit within Hadoop-based data processing pipelines.
HS: What made you want to talk about Graph Processing?
I’ve been working with graphs, graph databases, and graph analytics for a while now, both in my research and in my “night-life” as an open-source developer. I’ve been involved with Giraph as a PMC member and committer since its incubation in the Apache Foundation. I believe the landscape of tools to manage graph data is delineating every day more, and Giraph is definitely a big player in this ecosystem. I think it’s time to make sure that practitioners understand the advantage of looking at their data as graphs (or should I say Big Graphs?) and that tools are out there, many of them production-ready, to help process Big Graphs as part of their workloads. By being part of the Giraph team, it comes natural to me to talk about this particular instrument, but I believe the message can be wider. In fact, many other tools out there, like Hama and GraphLab, share many points with the paradigm implemented by Giraph.
HS:What sessions are you most interested in seeing?
Quite frankly, I’m very interested in the real-time side of Big Data. I think it is pushing to a new change of paradigm, since the original introduction of offline processing with MapReduce. In my opinion, that’s where some of the current and next challenges are. This means I’ll attend sessions about Storm, but not only. We are starting to see people looking at Giraph from a Data Mining perspective. I am also involved in an open-source project, Okapi, that provides a Machine Learning library for graphs algorithms. You can think of it as Apache Mahout for graphs. So I’ll be looking at the needs and limitations of current approaches in that field, and where we can help with Giraph. I’m thinking in particular at iterative computations, where Giraph shines.
HS: Thanks! Good luck with your session, and we’ll see you in Amsterdam.
Claudio Martella is a fetishist of graphs. He is a researcher at the Large-scale Distributed Systems group of the VU University Amsterdam. His topics of interest are large-scale distributed systems, graph processing, and complex networks. He has been a contributor to Apache Giraph since its incubation, where he is a committer and a member of the PPMC.