Continuing our ad hoc series of Hadoop Summit speaker interviews. This short interview is with Paco Nathan, Director, Data Science from Concurrent. You can register for Hadoop Summit here, and see the detailed schedule here.
HS: Tell us about your session. What is it about? Why does it excite you?
PN: My session is “Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”. This is about an open source project for running large-scale apps based on machine learning algorithms on Hadoop. Of course, the obvious approach would be to use Mahout or simply implement the required algorithms directly in Java. That works for people who are expert at writing Java-based MapReduce apps. However, these days the demand for leveraging Hadoop has extended to a much broader audience. People who are expert at analytics framework — e.g., SAS, R, Microstrategy, etc. — are generally not the expert Hadoop programmers. What excites me about this project is that people who work with say SAS or R can leverage their expertise, create predictive models in their preferred tools, then export the models as PMML and run at scale on Hadoop … with writing a single line of code. Alternatively, if an app developer does need to be involved, the PMML integration requires merely 1-2 lines of code — which is trivial. No ERP needed for that. So this “Pattern” project is about staffing and tearing down the walls which separate teams, so that complex analytics workflows can be built by leveraging the expertise across an organization.
PMML is, formally speaking, a means of describing the business process of a predictive modeling workflow. That workflow abstraction is important. For example, PMML has excellent features for ensembles and other complex patterns encountered in the more competitive areas of industry. While you won’t find an enormous amount of ensemble support in SAS or R today, that’s changing. For example, the lessons of the Netflix Prize certainly have shown much about the power of ensembles in machine learning. We believe that it’s going to become an important area of innovation. Based on the Pattern project, the Cascading API now makes ensembles quite simple to implement.
Another important aspect of workflows is that the flow planner and the compiler have a full view of the entire workflow represented within an app. That’s a full view for debugging, optimization, exception handling, utilization monitoring, notifications, etc. In other words, all the components of an app — the data preparation, model scoring, loading results into production use, etc. — can be built into a single JAR file. Next time you’re troubleshooting a mission critical app on a large cluster, using a large data set, ask yourself: Would it be simpler to debug this problem if it were N different apps patched together, or if it were a single app?
So the staffing benefits, the operationalizing benefits, the support for leading edge research in machine learning — these are all the exciting parts of Pattern. I’m also grateful to be working alongside several brilliant collaborators from different companies, who’ve been contributing code into the Pattern open source project.
HS: What other sessions are you most excited about?
PN: “Should I be using Scalding or Scoobi or Scrunch?” by Christopher Severs at eBay and “Hadoop – Enabling Expanded Financial Market Analysis Techniques while Improving Investment Performance” by Kevin Coogan at AmalgaMood.
HS: What has changed in the world of Hadoop compared to last year?
PN: Frankly, I handle much of the customer use case analysis for Cascading, and over the past year we’ve seen an interesting shift. Whereas a couple years ago, most of the use cases tended to be the “expected” areas of ecommerce — ETL, sessionization, marketing funnel, anti-fraud, etc. — we’ve seen a more recent shift into genomics, agronomics, geospatial, climatology, etc. These growth areas tend to be much more sophisticated in terms of applying machine learning, complex optimization problems, etc. Along with that, we’re seeing the “Internet of Things” grow, which also tends toward optimizations problems. I think that the world of Big Data and Data Science will pivot accordingly, since these problems require very different kinds of math than, say, advertising. All for the best. I’m also seeing a huge uptick in the use of functional programming languages — Clojure, Scala, Python, etc. — primarily for the software engineering aspects of maintaining large, complex machine learning apps. That’s exciting. It’s especially pronounced when I lecture at universities, where “functional programming” and “optimization” are guaranteed hot topics.
HS: Thanks! And best of luck with your session.
Paco Nathan is the Director of Data Science at Concurrent in SF and a committer on the Cascading http://cascading.org/ open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics — with 25+ years in the tech industry overall. For the past 10+ years Paco has led innovative Data Science teams, building large-scale apps. He is the author of the O’Reilly book “Enterprise Data Workflows with Cascading”.