Continuing our series of interviews and viewpoints from speakers at Hadoop Summit Europe, this interview is with Florian Douetteau, CEO, Dataiku, who will be speaking on “Semi-Supervised Learning on User Web Sessions with Hadoop” as part of the ‘Data Science’ track on Day 2, at 11:50am. See the full schedule here.
HS: Tell us a little about your session.
I will talk about a practical approach to understand customer journeys on websites, using machine learning on Hadoop.
We’ve met at Dataiku several uses cases where a Web Site Editor wanted to optimise a large, content-oriented, website. Imagine journeys on a large online newspaper. Some user come each day, other come through google news, yet another will come from Facebook. Some users will look specifically for a topic while some others are interested in many different sections. And finally some will spend hours commenting articles ! Because journeys are so diverse, it’s complex to optimise and run A/B tests for a particular metric, such as engagement, or time spent or advertising revenues.
We will present a semi-supervised learning approach that helps the Web Site Editor to find his way among the maze of web logs. In this approach, the editor is able to discover clusters of similar journeys, tag them, and then follow a specific metrics for a specific kind of journeys. This way the editor can start monitoring and optimising a particular sub-population, e.g. “new users coming from social web sites and like to comment a little“.
HS: What made you want to talk about Machine Learning for Web Journeys?
We are very interested at Dataiku by uses cases where it’s not just about optimising a specific business metrics or area. Recommender systems or advertising display optimisations are complex problems that are subjects to continuous research, but, on the other hand, there’s a lot of others unique problems left out there, untackled! Machine learning with Hadoop can help delivering unique solutions to each of these unique problems.
From a technical point of view, it’s quite interesting to realise the amount of work in data integration, preparation or extraction required to create some sense out of web log data. In the solution, different languages are used depending on the stage on the analytics pipeline. Start with some Pig, continue with an Hive, finish with a Python (even if this sequence of animal makes little from the perspective of the biological food-chain!). The Hadoop ecosystem today is like a tower of Babel. And the tower will continue to grow with the deployment of YARN that Hortonworks deployed starting HDP 2.0. At Dataiku we’ve built a Studio that servers as a Swiss-Army Knife where Hive Python and R are the blades. It helps connecting the growing Hadoop ecosystem with existing statistical and machine learning tools.
HS: What other sessions are you most interested in seeing?
- “Finding Allelic Frequencies Using MapReduce/Hadoop” by Mahmoud Parsian
- “Leveraging an All-Access Pass to BI Tools on Hadoop with Cascading Lingual” by Chris Wendel
HS: Thanks! Good luck with your session, and we’ll see you in Amsterdam.
Florian is Dataiku’s Chief Executive Officer. Florian started his career at Exalead, an innovative search engine technology company. There, he led a R&D team of 50 brilliant data geeks, until the company was bought by Dassault Systemes in 2010 for $150 million. Florian was then CTO at IsCool, a european leader in social gaming, where he managed game analytics and one of the biggest european cloud setup. Florian also served as freelance Lead Data Scientist in various companies, such as Criteo, the European Advertising leader. Florian speaks regularly at technical groups such as the Open World Forum or the Paris Java User Group.