Continuing our series of interviews and viewpoints from speakers at Hadoop Summit Europe, this interview is with Andrew Wang, Software Engineer, Cloudera who – along with Colin McCabe – will be speaking on “In-memory caching in HDFS: Lower latency, same great taste” as part of the ‘Committer’ track on Day 2, at 11:50am. You can see the detailed schedule here.
HS: Tell us a little about your session.
Colin and I will be talking about our recent work to add support for centralized cache management to HDFS. This project was motivated by issues we’ve seen crop up at customer sites related to data hotspots and mixed batch and interactive query workloads. Hotspots can lead to load imbalance and poor query performance, while mixed workloads often struggle with interactive query latency. Furthermore, there were also a number of performance bottlenecks in the HDFS read path which prevented applications from achieving true memory speed reads on cached data.
We’ll be talking about how we addressed these issues with the design and implementation of centralized cache management in HDFS, present experimental results for a number of MapReduce and Impala workloads, and also discuss some current limitations and how they can be addressed by future work.
HS: What made you want to talk about in-memory caching for HDFS?
The rapidly declining price of RAM makes in-memory computation a really interesting space. It’s now relatively affordable to have clusters with terabytes of aggregate memory, which is large enough for many interesting working sets. Figuring out how best to exploit cluster memory has the potential to unlock 10-100x performance improvements, which is enough to get any engineer excited.
HS: What sessions are you most interested in seeing?
I always enjoy the applications talks, it’s great to hear the real-life experiences of users of our software. I’m also looking forward to the Hive+Tez and standalone block manager talks, since these are both recent efforts that I’d like to know more about.
HS: Thanks! Good luck with your session, and we’ll see you in Amsterdam.
Andrew Wang is a software engineer at Cloudera on the HDFS team. Previously, he was a PhD student in the AMP Lab at UC Berkeley, where he worked on problems related to distributed systems and warehouse-scale computing. He is a committer on the Apache Hadoop project, and holds masters and bachelors degrees in computer science from UC Berkeley and UVa respectively.
Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and Computer Engineering at Carnegie Mellon.