Print Program

Schedule

Keynotes

June 3
Waiting for Hadoop  (w/ apologies to Samuel Beckett)

Merv Adrian
Research VP, Information Management, Gartner, Inc.

During 2013, Hadoop made a fundamental and long-promised transition from a highly scalable blunt instrument for cost-effective processing of massive volumes of data in a file system. Today, it’s much more, and promises for what can be built on the foundation of Hadoop 2.0 are already raising the bar dramatically on expectations. Some will not be met for a while. What’s being delivered now, and how will you leverage it? Gartner research examines Hadoop’s changing role, use cases already deployed, the evolution of the commercial distribution landscape and the  potential for the technology.

Transforming data into action using Hadoop, Excel, and the Cloud

Quentin Clark
Corporate Vice President, Microsoft

Big data. Small data. All data. You have access to mountains of data from inside the walls of your business and across the web. We’ve all seen the potential in data – from predicting election results, to preventing the spread of epidemics. The question is… How can you use your data to help move your business forward? Join Quentin Clark, Microsoft Corporate Vice President, to learn how to transform data into action using Hadoop, Excel and the Cloud.

Unlocking Hadoop’s Potential

Arun Murthy
Founder and Architect at Hortonworks, Former VP Apache Hadoop at Apache Software Foundation

Abstract: Hadoop has come a long way from its humble roots as a web-scale, batch-only data processing system. In 2013m the release of Hadoop 2 and its YARN-based architecture unleashed a wave of interactive and realtime use cases capable of running natively in Hadoop. YARN, acting as Hadoop’s data operating system, enables you to say goodbye to data silos and say hello to an era where Hadoop acts as a shared, multi tenant service supporting a wide range of batch, interactive, and realtime use cases aimed at delivering business value from a centralized and shared data set. In this talk Arun Murthy will share the very latest innovation from the community aimed at accelerating the interactive and realtime capabilities of enterprise Hadoop.

How Big Data Boosts the Network (and other stories from AT&T)

Victor Nilson
Senior VP for Big Data, AT&T

Running one of the world’s largest networks is a complex task. AT&T analyzes millions of data points per minute, triggering hundreds of thousands of adjustments to optimize our network and services. In the past, when Big Data technology did not exist to support these operations, we created it.  Now we are embracing Hadoop, and we are adding to its ecosystem by contributing some of our technology into open source. These include tools for easier visualization and analytics. Network may have been the start, but it’s not the whole picture. We are using our expertise to support  internal and external use cases that can be creative and surprising.

June 4

Enterprise Hadoop for Pools, Ponds, Lakes, Clouds and Beyond

Shaun Connolly
VP Corporate Strategy

Abstract: Data has disrupted the datacenter driving an architectural shift with Apache Hadoop playing a key role. As more and more applications are created that derive value from the new types of data from sensors/machines, server logs, clickstreams, and other sources, the enterprise “Data Lake” forms with Hadoop acting as a shared service for efficiently delivering deep insight across a large and diverse set of data at scale. While these Data Lakes are important, a broader lifecycle needs to be considered that spans development, test, production, and archival and that is deployed across a hybrid cloud architecture. During this session Shaun Connolly, VP of Strategy for Hortonworks, will talk through a vision of enterprise Hadoop that powers a connected lifecycle of workloads and of data clouds, ponds, lakes, glaciers and more that enable businesses to democratize sets of data for their user communities in a flexible and prescriptive manner.

Hadoop and Analytics 3.0

Thomas H. Davenport
President’s Distinguished Professor of Information Technology and Management, Babson College

Online startups often have no legacy infrastructure to wrestle with and can adopt Hadoop as their primary data management platform from the beginning. But large, established organizations have to integrate Hadoop with an existing architecture. In this presentation, Tom Davenport will describe a “Big Data in Big Companies” study of 20 large organizations that are pursuing big data projects, most with the aid of Hadoop. He’ll discuss “Analytics 3.0,” a collection of analytical activities that involve pursuing both big and small data on Hadoop and other platforms. He will describe both the technology, culture, and processes that these organizations employ, and will provide examples of several organizations that have already achieved substantial business value from Hadoop-enabled Analytics 3.0

Panel: Looking to the Past to Define a Future for Hadoop

Moderator:  Jeff Kelly
Principal Research Contributor, Wikibon

The rate of innovation in and around Apache Hadoop is astounding and at times complicated and unwieldy. Where is this headed long term? What’s at the edge of the horizon that we can expect? We’ve asked industry pundit Jeff Kelly from Wikibon to hold a discussion with two of the original visionaries for Apache Hadoop, Doug Cutting and Arun Murthy, about what they see on the horizon for Hadoop.

Panelists:
Doug Cutting, Founder Apache Hadoop / Chief Architect, Cloudera
Arun Murthy, Apache Hadoop PMC member / Founder, Hortonworks

Big Data Analytics at Massive Scale

Oliver Ratzesberger
Senior Vice President, Software, Teradata Labs

The expanding diversity of analytic requirements has driven advancements in Hadoop, Analytics, SQL and NoSQL engines. According to Gartner, the Logical Data Warehouse is the new data management architecture for analytics which combines the strengths of data warehouses with alternative data management and access strategies that form the new best practice by the end of 2015. In order to orchestrate analytical platforms across a layered data architecture, the industry needs greater abstractions for the user community that isolates them from the growing complexity of specialized engines. Big Data Analytics must be executed in an agile, scalable and frictionless environment with minimal programing.

Enterprise Hadoop and the Open Hybrid Cloud

Tim Yeaton
Senior Vice President, Infrastructure Business Group, Red Hat

Enterprises expect Hadoop to provide faster time to results and lower total cost of ownership while maintaining the rapid innovation and flexibility of an open source community. Red Hat and Hortonworks are integrating our products into big data solutions for a seamless customer experience. Come learn about several modern data architectures that bring Apache Hadoop to the open hybrid cloud.

June 5

Hadoop Intelligence – Scalable machine learning

Amotz Maimon
Chief Architect, Yahoo!

This talk will cover how Yahoo is leveraging Hadoop to solve complex computational problems with a large, cross-product feature set that needs to be computed in a fast manner.  We will share challenges we face, the approaches that we’re taking to address them, and how Hadoop can be used to support these types of operations at massive scale.

Levering Big Data to change your world

John Schitka
Solution Marketing, Big Data, SAP

Big Data is changing our world, society, businesses and lives unlike anything before. New insights transform the way we do business, work and live our lives. To compete you need the instant results, infinite storage and data exploration you get with SAP and Hadoop. This powerful, highly distributed, elastic platform delivered by SAP HANA platform for Big Data and Hadoop enables business and changes lives by powering genomic analysis and research. Hadoop and SAP technologies are complementary in the modern data architecture and together amplify the value of Big Data helping you gain unprecedented business value.

Panel: Hadoop in the Entperprise

In this session, join our panelists as they discuss their journey with Hadoop from proof of concept to production. They will explore and discuss the business impact of Hadoop, share the challenges they experienced, solutions they found and highlight key lessons you can use in your organization.

Panelists:
Venkat Achanta, SVP/Global Head of Analytics and Big Data at AIG
David Gleason, Head of Data Strategy at BNY Mellon
Ratnakar Lavu, SVP of Commerce at Kohl’s
Chris Dingle, Sr. Director of Audience Solutions at Rogers Media Inc.
Anu Jain, Group Manager of Strategy and Architecture at Target
John Williams, SVP of Platform Operations at TrueCar
Daljit Rehal, Strategic Systems Director, British Gas

 

Day 1 » Tuesday, June 3
Tracks:
Deployment and Operations
Hadoop for Business Apps
Future of Hadoop
Data Science
Committer
Hadoop Driven Business
8:00 - 8:45am Breakfast
8:45 - 10:45am Keynotes and Plenary
10:45 - 11:15am Break
11:15 - 11:55am
Session Abstract× Close
Watch: VideoSlides

During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arrange a dinner for me and Timbuktu (my favourite Swedish rap and reggae artist), if I prove somehow that I am the biggest fan of Timbuktu in my home country. Because at Spotify we attack all problems using data-driven approaches, I decided to implement a Hive query that processes real datasets to figure out who streams Timbuktu the most frequently in my country. Although this problem seems to be well-defined, one can find many challenges in implementing this query efficiently and they relate to sampling, testing, debugging, troubleshooting, optimizing and executing it over terabytes of data on the Hadoop-YARN cluster that contains hundreds of nodes. During my talk, I will describe all of them, and share how to increase your (and the cluster’s) productivity by following tips and best practices for analyzing large datasets with Hive on YARN. I will also explain how the newly-added features to Hive (e.g. join optimizations, OCR File Format and Tez integration that is coming soon) can be used to make your query extremely fast.

A Perfect Hive Query For A Perfect MeetingAdam Kawa
Session Abstract× Close
Watch: VideoSlides

Bloomberg needed to deploy a full Hadoop stack across a variety of environments, from dev to test to production, but they wanted repeatable tools to accomplish this task… so they built some recipes.  In this session, the Bloomberg team will review how they used open source deployment tools to support a variety of Linux distributions and deploy out-of-the-box full stack high availability.  You will also see how they were able to automate deployment of these clusters based on a stable vendor distribution but with the ability to mix in new package versions.

Open source recipes for Chef deployments of HadoopClay Baenziger
Session Abstract× Close
Watch: VideoSlides

Maintaining large scale distributed systems is a herculean task and Hadoop is no exception. The scale and velocity that we operate at Rocket Fuel presents an unique challenge to maintain large and complex Hadoop clusters. We observed 5 fold PB growth in our data, 6 fold number of machines, all in just year’s time. As Hadoop†became a critical infrastructure at Rocket Fuel, we had to ensure scale and highly availability so our reporting, data mining, and machine learning can continue to excel. We also had to ensure business continuity with a disaster recover plans in the face of this hockey stick growth. In this talk we will discuss what worked well for us and what we learned the hard way. Specifically we will (a) describe how we automated installation and dynamic configuration using Puppet and InfraDB (b) describe the performance tuning for scaling Hadoop (c) talk about the good, bad, ugly of scheduling and multi-tenancy (d) details of some of the hard fought issues (e) brief about our Business Continuity Plans and Disaster Recovery (f) touch upon how we monitor our Monster Hadoop cluster and finally (g) share our experience of Yarn at Scale at Rocket Fuel.

Hado"OPS" or Had "oops"Kishore Yellamraju, Abhijit Pol
Session Abstract× Close
Watch: VideoSlides

Commercial software vs. open source. Each side has its share of passionate believers. Ultimately, both communities depend on one another. This presentation, we’ll discuss how SAS leverages Hadoop and other open source projects to address some of the biggest challenges businesses face.  Plus, we will show how SAS “plays well” and contributes to shared success with open source communities.

With the Rise of Open Source, Why Organizations Still Turn to SASBrian Garrett
Session Abstract× Close
Watch: VideoSlides

Hadoop is a very flexible platform for storing various disparate types of data – the variety dimension of the famous 3 V’s. We will discuss how we use Hadoop to combine audio, video, and social media data sources to analyze Caltrain activity and provide real time insight into variance from the regular schedule. This is an instance of the general problem of combining disparate data sources to reason about the current operational state of a business system. We will cover how we store raw sensor and social media data in hdfs then use various processing frameworks to refine that data and store it in HBase. Specific examples of using Hive, Flume, Python, SerDe, and Avro to take data from various inputs and perform necessary transformations to make the data suitable for analysis will be explained. Finally, we will discuss how we develop and integrate the analytical components using Python’s Numpy, Scikit-learn, OpenCV, and Pandas libraries. The goal of the analyses is to recognize train sounds in audio streams, detect trains in video streams, and combine that with data from social media. Ultimately, we aim to determine what train is where, and how it is running relative to the schedule.

Railroad Modeling at Hadoop ScaleTatsiana Maskalevich, John Akred
Session Abstract× Close
Watch: VideoSlides

What to do with all that memory in a Hadoop cluster? Should we load all of our data into memory to process it? The goal should be to put memory into its right place in the storage hierarchy, alongside disk and solid-state drives (SSD). Data should reside in the right place for how it is being used, and should be organized appropriately for where it resides. This proposed solution requires a new kind of data set called the Discardable, In-Memory, Materialized Query (DIMMQ).  In this session we will talk through how we can build on existing Hadoop facilities to deliver three key underlying concepts that enable this approach.

Discardable, In-Memory, Materialized Query for HadoopJulian Hyde
Session Abstract× Close
Watch: VideoSlides

Currently users of the Apache Falcon system are forced to define their application as Apache Oozie workflow. While users are hidden from the scheduler and its working via Falcon, they still end up learning about Oozie as the application are to be defined as Oozie workflows. Objective of this is to provide a pipeline designer user interface through which users can author their processes and provision them on Falcon. This should make building applications on Falcon over Hadoop fairly trivial. Falcon has the ability to operate with HCatalog tables natively. This means that there is a one to one correspondence between a Falcon feed and an HCatalog table. Between the feed definition in Falcon and the underlying table definition in HCatalog, there is adequate metadata about the data stored underneath. This data (sets of them) can then be operated over by a collection of transformations to extract more refined dataset/feed. This logic (currently represented via Oozie workflow / pig scripts / map-reduce jobs) is typically represented through the Falcon process. In this talk we walk through the details of the pipeline designer and the current state of this feature.

Hadoop First ETL Authoring using Apache FalconSrikanth Sundarrajan, Naresh Agarwal
12:05 - 12:45pm
Session Abstract× Close
Watch: VideoSlides

In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations. In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.

Data Discovery On Hadoop - Realizing the Full Potential of Your DataThiruvel Thirumoolan, Sumeet Singh
Session Abstract× Close
Watch: VideoSlides

Symantec today owns the largest known security metadata store in the world. We were faced with a challenge to scale even larger to meet the exploding volume of data we need to collect, analyze and use to provide our customers with the best-of-breed protection. Symantec set on a journey to build a new multi-petabyte big data platform from ground up to meet our batch and low latency analytics needs. The biggest challenge was to stand up the platform in a matter of few months. The talk will concentrate on how we build the platform, the criteria we used in our decisions. We will share our thoughts on cluster build (disk to compute ratio, network bandwidth – leaf to spine ratio, etc.,), Hadoop distro selection (weighted matrix ranging from operational management, DR, Security, Ingestion Management etc). We will also share our experiences in hardening the platform including best practices to ramp up operations in a short time.

Lessons Learned from Building an Enterprise Big Data Platform from Ground Up in a Matter of MonthsRoopesh Varier, Srinivas Nimmagadda
Session Abstract× Close
Watch: VideoSlides

Large enterprises are great places for Hadoop, but their strict technical requirements (often centered around relational databases) make adoption difficult.  In this session, we’ll discuss diverse technologies used to make Hadoop enterprise friendly.  We will cover specific hardware architectures, the relative strengths of Hadoop and a data warehouse, common coexistence design patterns, technology used to quickly and flexibly transfer data, and how to use this technology to meet enterprise security requirements.  Along the way, we’ll uncover how coexisting Hadoop with an enterprise’s existing data warehouse is a strong path for furthering enterprise Hadoop adoption.

Hadoop is not an island in the enterprise: Understanding deployment practices that merge the strengths of Hadoop and the Data WarehouseJoe Rao
Session Abstract× Close
Watch: VideoSlides

Apache Giraph is a powerful tool for graph-based analytics and insights on massive datasets. It is highly performant compared to Hadoop and Hive for graph and/or iterative computations and can scale beyond a trillion edges on real social networks. Facebook, along with other companies and researchers, have been running Apache Giraph in production since late 2012. Since then, the community has expanded the problem space of the project to beyond simple graph algorithms (i.e. page rank) into complex data transformations around a vertex-centric programming model. The Giraph framework has migrated toward a more flexible bulk synchronous parallel (BSP) computing framework. We support a master-worker model of computation where the master can select the computation, messages types, and message combiners of each superstep according to runtime information (similar to a SIMD computer). Global state is recorded and accessible via scalable aggregators. In this talk, we describe our modified API with examples of how to use it in practice. We will explain how to handle applications that have large message transfers via ?superstep splitting?. Finally, we will also detail how Facebook runs Apache Giraph in production and the types of applications and teams that we support.

Dynamic Graph/Iterative Computation on Apache GiraphAvery Ching
Session Abstract× Close
Watch: VideoSlides

Data is very important to everything we do at Yahoo. Grid services at Yahoo span Hadoop core components, related services and grid applications. The Yahoo grid services team manages high-performance, distributed and fault-tolerant implementations of Hadoop clusters that scale to tens of petabytes of data with reduced operational cost and meet the requirements of a myriad of Yahoo business units for online properties and advertising systems that serve upwards of 800 million customers. In this talk, learn from Yahoo principal architect Sagi Zelnick and Splunk principal architect Ledion Bitincka about how to provide exploratory self-service analytics using a combination of Hadoop and Hunk: Splunk Analytics for Hadoop and NoSQL Data Stores.

Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersSagi Zelnick, Ledion Bitincka
Session Abstract× Close
Watch: VideoSlides

In this talk, we present BlinkDB, a parallel computing framework that is being built on Hive, Shark and Facebook Presto. BlinkDB uses a radically different approach where queries are always processed in near real time, regardless of the size of the underlying dataset. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that approximate query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that precisely indicate the quality of the sampled results. BlinkDB is being integrated in Hive/Shark and in Facebook Presto, and is also in the process of being deployed at a number of companies. This talk will feature an overview of the BlinkDB architecture and its design philosophy. We will also cover how the audience can leverage this new technology to gain insights in real-time using a variety of real-world use cases from our early adopters.

BlinkDB: Querying Petabytes of Data in Seconds Using SamplingSameer Agarwal
Session Abstract× Close
Watch: VideoSlides

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. In addition, it provides for extensibility by way of User Defined Functions. There are some third-party libraries for Pig geared for use by Datascientists. In this talk, I will explore how to integrate popular libraries with Apache Pig to provide a robust environment to do data science. I will explore gaps and potential improvements that can be had based on our experience using Pig as a tool for datascience. In particular, we will focus the role of Pig as a data aggregation tool as well as a platform to evaluate machine learning models at scale.

Apache Pig for DatascientistsCasey Stella
12:45 - 1:45pm Break
1:45 - 2:25pm
Session Abstract× Close
Watch: VideoSlides

Summingbird is an open source library to enable portable, type-safe, map-reduce style programming. Summingbird allows you to write business logic one time, and run than logic on Hadoop, Storm, and potentially other platforms such as Spark or Akka. In addition to being a portable layer for writing streaming map/reduce jobs, Summingbird also supports hybrid, or so-called lambda-architecture, designs. Such hybrid systems use realtime-layers to deliver fast, but potentially slightly lossy, results, but replace those with transactional and exact results from Hadoop as they become available. By using the summingbird API, you avoid much of the complexity of designing such systems. This talk will show some example computations that serve user facing features at Twitter. We will see how we model some interesting computations in a streaming setting, such as unique user counts, heavy-hitter/top-K rankings, as well as classic data-cubes for analytics dashboards.

Summingbird: Streaming, Portable, Map/reduceOscar Boykin
Session Abstract× Close
Watch: VideoSlides

The advertising industry is no stranger to Hadoop given the scale it has to operate at and the complexity of the processing tasks. At BrightRoll we deployed Hadoop along with Flume NG and HBase to deliver billing reports and business metrics accurately, with no duplication, and with a high degree of confidence. In this talk, we will cover the approach we used to develop a data processing pipeline currently handling 600+ Billion (and growing) events every month, and some of our most pressing data challenges such as, handling duplicates at a large scale, enabling low end-to-end processing latency, and guaranteeing time sequence processing. In addition, we will talk about ease of adding new processing, either via configuration or via pluggable components, and how we run Hadoop in continuous mode while maintaining idempotent characteristics.

Processing Complex Workflows in Advertising Using HadoopBernardo de Seabra, Rahul Ravindran
Session Abstract× Close
Watch: VideoSlides

Data must be harnessed in actionable time to gain value and have positive results for your business. You need the instant results, infinite storage and data exploration you get with SAP and Hadoop. Process large data sets in Hadoop, analyze in real-time with the SAP HANA platform and visualize it with tools like SAP Lumira.. Learn how these two powerful and complementary technologies connect and interact to provide enterprises with future-proof modern data architecture. This session will demonstrate how solutions from SAP and our Hadoop partners helping organizations seize the future, harness and gain unprecedented insight from Big Data.

Harnessing Big Data in Real-TimeJohn Schitka
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop YARN is the modern Distributed Operating System. It provides a generic data processing engine beyond MapReduce and enables the Hadoop compute layer to be a common resource-management platform that can host a multitude of applications. Multiple organizations are able to share the same cluster to build their applications IN Hadoop without themselves worrying about resource management, isolation, multi-tenancy issues etc. In this talk, we’ll first cover the current status of Apache Hadoop YARN – how it is faring today in deployments large and small, and how it stands apart from the batch-processing-only world of Hadoop 1.0. We’ll then move on to the exciting future of YARN – features that are making YARN a first-class resource-management platform for enterprise Hadoop. We’ll discuss the current status as well as the future promise of features like – rolling upgrades with no/minimal service interruption, high-availability, support for long running services (alongside applications) like Apache HBase, Apache Storm etc natively on YARN without any changes, fine-grained isolation for multi-tenancy using CGroups etc, powerful scheduling features like application priorities, preemption, application level SLAs, and other usability tools like application-history-server, client submission web-services, better queue management, developer tools for easier application authoring

Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli, Jian He
Session Abstract× Close
Watch: VideoSlides

Hadoop is a distributed framework, which provides an economical way to store and process huge volume of data. The system level details are abstracted away nicely by the frame work. Storage and processing components are co-located and thereby giving a data centric approach to process the data which gives far better performance at scale. Cloud computing offers IaaS (Infrastructure As A Service) for Storage and Processing, PaaS (Platform as a Service) as Leased OS and SaaS (Software as a Service) as leased software. Its core advantage is its ability to get the resources ?on demand? with the decoupled architecture and achieve “Economies of Scale?. Through horizontal scalability of the infrastructure, we have enabled a vertically scalable system called Cloud IaaS (Infrastructure as a Service) for Storing and Processing. Hadoop made it possible by horizontal scalable architecture with data centric approach. Market claims that data centric approach can be fit-in with process centric method. This paper reveals the trade-offs and suggests whether it?s advisable to have Hadoop in Cloud.With the question, “Is Cloud a right companion for Hadoop” in mind let?s take the journey to explore the challenges and developments happening in and around the marriage between Hadoop and Cloud.

Is Cloud a Right Companion for Hadoop?Chintan Bhatt, Saravanan Prabhagaran
Session Abstract× Close
Watch: VideoSlides

Cascading Pattern is an open source project that takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Apache Hadoop. This machine-learning library works by translating PMML – an established XML standard for predictive model markup – into data workflows based on the Cascading API in Java. PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Cascading Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, K-Means, Hierarchical Clustering, etc., with more machine-learning support being added. Benefits include greatly reduced development costs and less licensing issues at scale – while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. In this presentation, Concurrent, Inc.’s Alexis Roos, will provide sample code that will show applications using predictive models built in SAS and R, such as anti-fraud classifiers. Additionally, Alexis will compare variations of models for enterprise-class customer experiments.

Pattern: An Open Source Project for Migrating Predictive Models from SAS, etc., onto HadoopAlexis Roos
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop systems are exploding in popularity and are being widely adopted by organizations to process enormous amount of data. What is often missing from these architectures is an understanding of the security implications and how to protect data at the cell level. In addition, many Hadoop guides do not have good protections outlined for these shared nothing architectures. Protection of the data requires a different method for these types of architectures. This talk will discuss the security challenges with information storage and retrieval in Hadoop, outline a detailed security design pattern for combining Apache technologies to enable data protection, providence and lineage, and give an overview of real world case studies where these patterns have been implemented.

Lessons Learned on How to Secure Petabytes of DataPeter Guerra
2:35 - 3:15pm
Session Abstract× Close
Watch: VideoSlides

Storm has become a popular solution for low-latency big data processing at Yahoo. Our Storm clusters represent one of the largest Storm installations in the world. To support larger-scale use cases and to reduce operating costs, we recently created a multi-tenant Storm service hosted on the Hadoop Grid. Our multi-tenant storm service supports several key objectives: * Each user can have a number of dedicated compute nodes, ensuring Storm applications are isolated from those of other users. * Storm applications have access to Hadoop datasets on HDFS that are monitored and managed as Hadoop services. * The multi-tenant service supports hundreds of Storm applications, each of which may use several hundred workers. * Events are injected into Storm applications using either a push or a pull model. In this talk, we will present the system architecture of our multi-tenant storm service and our enhancements to Storm and Hadoop. Our architecture design and enhancements enable the Storm service and Hadoop services to unify infrastructure for authentication, service registry, data storage and operation. We plan to release key enhancements to Apache Hadoop and Apache Storm. This talk will also share our experiences of operating Storm services along side with Hadoop services.

Multi-tenant Storm Service on Hadoop GridBobby Evans
Session Abstract× Close
Watch: VideoSlides

This talk covers practical considerations in using and operating Hadoop in private & public clouds and the tradeoffs between Hadoop’s own resource virtualization and using VMs.   We start with considerations in using Hadoop in public clouds where clusters are spun-up and down and issues related to efficiently accessing data that persist. We then focus on shared multi-tenant hadoop clusters in private clouds. Dedicated cluster infrastructure for each tenant is not practical. While use cases such as on-demand development or test clusters using VMs are a good idea,  one needs to avoid resource fragmentation; we highlight subtle storage fragmentation problems and solutions. For production clusters, we discuss  the benefits of shared hadoop clusters with its built-in resource virtualization and multi-tenancy features instead of using per-tenant Hadoop clusters using virtual machines. If you are using VMs for production Hadoop in private clouds,  we described best practices.  We conclude with related recent and upcoming Hadoop features.

Hadoop in Clouds, Virtualization and Virtual machinesSanjay Radia, Suresh Srinivas
Session Abstract× Close
Watch: VideoSlides

Twitter has been an early adaptor of Hadoop since its beginning and now scaling to multiple clusters of thousands of nodes crunching petabytes of data every day. We quickly evaluated and deployed new features of Hadoop 2 in production. Hadoop HDFS Federation along with HA helped in building non-stop clusters available to users even during software upgrades. We evaluated Hadoop 2.0 for multiple benchmarks and user jobs on a new hardware platform before going to production. We also deploys application platforms such as Pig, Cascading, Scalding and more on top of Hadoop. These platforms were evaluated and improved to work with Hadoop 2.0. Production jobs and workflows written for these platforms were validated and modified to take advantage of what’s great with Hadoop 2.0. Transition from Hadoop 2.0 to Hadoop 2.3 involved multiple usability, stability and scalability fixes both in HDFS and YARN layer. Twitter has contributed back those fixes to open source and are working closely with community on new features. Twitter also used its open source tool hRaven extensively to compare performance gains and planning for capacity growth of clusters. Learn about challenges and learnings Twitter made during its journey to move thousands of production jobs to a new multi-petabyte scale system.

Hadoop 2 @Twitter, Elephant ScaleLohit Vijayarenu, Gera Shegalov
Session Abstract× Close
Watch: VideoSlides

The most efficient analytical query engine is now accessible on Hadoop. Peter Boncz, architect of the transformational MonetDB and X100 VectorWise databases, explains design goals and features of this offering on Hadoop, codenamed “Project Vortex.” It uses HDFS native storage, and uses Hadoop clusters for storing, updating and querying huge databases. Introduced in 2010, the transformational columnar, vectorized analytical database inspired other analytical databases but continues to dominate performance charts. Project Vortex brings industrial-grade SQL (optimizer, APIs, access control, etc.) to Vector technology: HDFS performance instrumentation, elastic scale-out, maintenance-free failover and resource optimization via YARN. This talk officially announces Vortex.

 

 

“Project Vortex”: The Debut of an Industrial-Strength DBMS to Truly Leverage HadoopPeter Boncz
Session Abstract× Close
Watch: VideoSlides

This talk will introduce Hivemall, a new open-source machine learning library for Apache Hive. Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive. Hivemall is very easy to use as every machine learning step is done within HiveQL and designed to be scalable to the number of training instances as well as the number of training features. We present a series of experiments using the KDD cup 2012, click-through rate prediction task and compare Hivemall to state-of-the-art scalable machine learning frameworks, including Vowpal Wabbit and Mahout. Moreover, we will share our experience and finding on using Hivemall for conversion rate prediction of terabytes of internet advertising data set. Through the project is young, it have already gotten a lot of attention as seen in 124+ stars and 31+ forks on https://github.com/myui/hivemall. We consider that this talk is particularly interesting and relevant to people already familiar with Hive and working on big data analytics.

Hivemall: Scalable Machine Learning Library for Apache HiveMakoto Yui
Session Abstract× Close
Watch: VideoSlides

Remember the promise of a centralized data warehouse, able to store all your company’s data in one place? The idea was a great, but the technology couldn’t deliver. Over the years, data growth has exploded and data has scattered, making it difficult to access and use. With the emergence of Hive and the interactive query engines like Hortonworks’ Stinger, Cloudera Impala and Apache Shark, we have the opportunity to bring data back together again. However, many people deploying Hadoop for business intelligence leverage the same techniques that they’ve used for their traditional BI and data warehousing technologies. To truly exploit Hadoop’s capabilities, enterprises need to abandon the tendency to prematurely normalization, structure and aggregation their data. Rather, by exploiting Hadoop’s schema on read capabilities, enterprises can focus on capturing all their data in a raw, unaltered form and apply structure “just in time” to support the ever changing needs of the business. In this session, the audience will see how Klout built one of the world’s largest data services by leveraging Hadoop’s schema on read capabilities with no commercial software. The audience will also learn how to leverage Hive’s non-scalar data types to create a flexible dimensional data model that adapts to new data with without complicated ETL, data transformations or pre-aggregation.

BI on Hadoop: How Hadoop and the New Interactive Query Engines Change the Game ForeverDavid Mariani
Session Abstract× Close
Watch: VideoSlides

PayPal operates a huge payment network for millions of merchants around the world. To help the vast number of merchants succeed in a world dominated by ecommerce giants like Amazon, PayPal is continously offering new capabilities and services. This talk will describe a Hadoop-powered data pipeline providing unique merchant insights. Data at the Oracle transaction tier is collected and streamed to Hadoop in low-latency batches. Raw data is further converted into an analytics-ready data set and flows into data processing workflows that span both traditional reporting and modern predictable analytics domains. The resulting data set is transported to a live production tier exposed as a group of RESTful web services. The talk will focus on how we integrate various traditional and lead-edge Big Data technology to build a raw data to actionable business insight pipeline. We will also discuss unique challenges when migrating from a Hadoop 1.0-based distribution to the latest YARN-based Hadoop while working on this project.

Building A Hadoop Powered Commerce Data PipelineJay Tang
3:25 - 4:05pm
Session Abstract× Close
Watch: VideoSlides

The business of selling new cars is fragmented across many parties, each with limited information about the transaction. This makes the market inefficient, to the detriment of all participants. Buyers, dealers and manufacturers cannot make the best decisions because they lack transparent data about prices, incentives, and allotments. Worse yet, each party assumes that the other has complete information (even though they don’t). We will share how our company built a business to provide data transparency and improve trust during car purchases. After years in business, we accelerated our growth by re-architecting on Hadoop and migrating production data from other platforms, without the wheels flying off. This was not a POC. We went all in with Apache Hadoop. One of us is a storage guy and the other is a processing guy, so together we will cover the entire story of how our team jumped into the deep end of the Hadoop pool (and how we pushed in our colleagues while they weren’t looking). We’re running Hadoop 2 with YARN on almost 2 petabytes, and that’s all we’ve ever known of Hadoop. If you run a growth company or write code, we believe we have a lot to share.

Hadoop Changes What Drives the Car Business: Moving from Anecdotes to DataJohn Williams, Russell Foltz-Smith
Session Abstract× Close
Watch: VideoSlides

This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.

Improvements in Hadoop SecuritySanjay Radia, Chris Nauroth
Session Abstract× Close
Watch: VideoSlides

Hadoop is known to scale well with hundreds of nodes and users. However, it’s sometimes too good in hiding the problematic nodes and jobs by having layers of retry mechanisms. Separate from the standard procedures ops team performs daily, (e.g. burn-in tests, hadoop health-check script, validation jobs, etc) I’d like to share a couple of tips for further stabilizing the clusters including the daily log analysis and checks we perform for finding 1) slow nodes 2) misconfigured nodes 3) cpu-eating jobs 4) hdfs-wasting users, etc. I hope people would find them useful for managing their clusters in scale.

Collection of Small Tips on Further Stabilizing your Hadoop ClusterKoji Noguchi
Session Abstract× Close
Watch: VideoSlides

Today’s leading companies need to be information-driven to succeed.  The Enterprise Data Hub isemerging as the first place to put data in the enterprise that offers multiple integrated forms of analysis and processing, and that can ingest data from and deliver data on-demand to existing systems for special-purpose handling. It is the emerging and necessary center of enterprise data management in the era of Big Data

The Future of Data Management - the enterprise data hubMatt Brandwein
Session Abstract× Close
Watch: VideoSlides

SociaLite is a Hadoop-compatible high-level query language for data analysis. It makes big data analysis simple, yet achieves fast performance with its compiler optimizations, often more than three orders of magnitude faster than Hadoop MapReduce programs. For example, PageRank algorithm can be implemented in just 2 lines of SociaLite query, which runs nearly as fast as C implementation. High-level abstractions in SociaLite help implement data analysis algorithms. For example, its distributed in-memory tables allow data to be stored, partitioned across machines, and with simple annotations, different partitioning schemes can be applied. SociaLite supports relational operations as well as built-in functions and aggregations to simply implement well-known algorithms, such as K-Means clustering that uses built-in ArgMin aggregation. Furthermore, queries can be extended with Java and Python functions, which allows arbitrary algorithms exressed in SociaLite. The high-level queries can achieve fast performance with various optimizations. The queries are compiled into Java code with compiler optimizations applied, such as prioritizations or pipelined evaluation. Also we use a smart memory allocator to minimize Java garbage collection time as well as object allocation time. We will discuss all the optimization techniques that are used in SociaLite to achieve its high performance.

SociaLite: High-level Query Language for Big Data AnalysisJiwon Seo
Session Abstract× Close
Watch: VideoSlides

Real time network packet analysis is extremely critical for any network security. Ability to collect all the network traffic and analyze it in real time is an elephantine challenge. Especially, during Denial of Service attacks, there could be millions of packets travelling over network per second. Not many systems can capture, analyze, store and provide alerts/insights at this rate. We at Cisco and Hortonworks together built a solution named OpenSOC that can capture, deeply inspect and analyze these packets; at the rate of 1.2 million packets per second in real time. This talk covers the use case and our use of Kafka-Storm-HBase-ElasticSearch to ingest 1.2 million network packets per second in real time. Specifically, we discuss how we started with just 5K packets per second and scaled the system to handle 1.2 million packets per second, the solution choices, different techniques and strategies, traditional and innovative approaches that made the performance jump through the roof. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutions. and last thing… OpenSOC is being open sourced. Find out how you can take advantage of it.

Analyzing 1.2 Million Network Packets per Second in Real TimeJames Sirota, Sheetal Dolas
Session Abstract× Close
Watch: VideoSlides

Enterprises need large scale security analytics and intelligence through Hadoop and algorithms that will provide them security threat protection that they don?t currently get from their tools like SIEM, firewalls, IPS? etc. All tools today are deterministic and are based on signatures or static tiles. In order to transcend static, manual methods, enterprises need to adopt Hadoop and machine learning in order to continuously process and learn hidden patterns within the data. In order to correlate billions of key events from various sources across the enterprise and find threat patterns, enterprises first need Hadoop clusters to store and process the data, perform correlations, and then apply machine learning algorithms that detect patterns. Further, applying the right machine learning methods is key as different data sets contain different data/parameters and the patterns detected will depend on the algorithms chosen. In fact, multiple algorithms may need to be applied in order to get fine-grained results. Once the infrastructure has been successfully deployed to perform the above, network security organizations may extend the deployment further to provide predictive results to prevent breaches proactively.

Using Hadoop and Machine Learning to Detect Security Risks and Vulnerabilities, and Predict Breaches in your Enterprise EnvironmentKarthik Kannan
4:05 - 4:35pm Break
4:35 - 5:15pm
Session Abstract× Close
Watch: VideoSlides

The metrics team at Hulu is responsible for processing and analyzing a very large volume of data (gigabytes of data/hour). Our data analysis platform is built on Hadoop. The needs of our consumers changes constantly, but the flow of data never stops – our challenge is providing support and analysis for our existing platform while building the next generation of data tools. We have a passion and commitment to open-source technology, which we’ve leveraged to build a number of custom interfaces that cross-cut our particular concerns, especially in regards to reporting, data quality, batch & ad/hoc queries and scheduling. As we’ve finished this set of tools, we always have our eye on the next generation of tools. At each iteration, we attempt to identify anything that prohibits us from scaling, including anything from error alerting and email noise to the amount of time it takes a new developer to learn our systems. This session gives an overview and demonstration of our data platform tools, in particular the tools we use to monitor our data pipeline and how we use these tools to research and resolve any issues that arise.

Lessons Learned - Monitoring the Data Pipeline at HuluTristan Reid
Session Abstract× Close
Watch: VideoSlides

This presentation looks at two methodologies trialed by the Apache Tajo team that promise to improve SQL-on-Hadoop data processing performance significantly. The first methodology involves novel ways of collecting statistical data in order to overcome the computational burden imposed by progressive cost-based query optimization. The second methodology concerns the general challenges of processing data on Solid State Drive systems, with JIT query compilation and vectorized engines put forward as potential solutions.

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineHyunsik Choi
Session Abstract× Close
Watch: VideoSlides

As the data world undergoes its cambrian explosion phase our data tools need to become more advanced to keep pace. Deep Learning has emerged as a key tool in the non-linear arms race of machine learning. In this session we will take a look at how we parallelize Deep Belief Networks in Deep Learning on Hadoop’s next generation YARN framework with Iterative Reduce. We’ll also look at some real world examples of processing data with Deep Learning such as image classification and natural language processing.

Introduction to Deep Learning on HadoopJosh Patterson, Adam Gibson
Session Abstract× Close
Watch: VideoSlides

We describe the motivation, design and the challenges faced while building the cloud based transcoding service, which processes all the videos before they go online in Yahoo! This system is unique in the sense this the first time Hadoop grid, which is a batch processing system, has been used to build a transactional system requiring predictable response times. In Yahoo we receive lots of video content on daily basis from our premium partners and also UGC content in case of Flickr. Incoming video may be in various formats and codecs which may not be streaming ready. We also need to support various bit rates and resolutions to support various screen form factors and network conditions. Incoming videos need to be transcoded to standard formats such as MP4 and WebM to support universal playback. Transcoding is very compute and I/O intensive. Yahoo has shared infrastructure of Hadoop clusters which are used across multiple products and allow scalable and elastic computing. In general Hadoop grids are used for textual processing and using it for processing binary files such as video required us to use Hadoop in innovative ways. In order to address inherent latency issues in Hadoop, which is a batch processing platform, we had to explore newer paradigms such as YARN. This presentation highlights some of the challenges we faced and how we overcame them.

Video Transcoding on HadoopShital Mehta
Session Abstract× Close
Watch: VideoSlides

Hadoop has gained tremendous traction in the past few years as a massively scalable, enterprise-wide analytical system. Its flexibility and cost advantage make it an attractive platform for running batch-oriented and interactive analytics on large data sets. The growing popularity of in-Hadoop databases like HBase let businesses consolidates their operational and analytical workloads into a single cluster. This overcomes the traditional model of copying data from the operational system to the analytical system, which often creates significant latency and introduces additional complexity and cost. Tomer Shiran, VP of Product Management at MapR Technologies, will describe how businesses can begin to integrate operational data into their analytics clusters. He will describe the tools businesses should use, including SQL-on-Hadoop technologies and NoSQL databases like HBase. The combination of these tools offers the familiarity of SQL with the speed and scalability of NoSQL databases.

Delivering on the Hadoop/HBase Integrated ArchitectureTomer Shiran
Session Abstract× Close
Watch: VideoSlides

Finding relevant information fast has always been a challenge, even more so in today’s growing “oceans” of data. This talk explores the area of real-time analytics and anomalies detection (in particular credit card fraud) using Apache Hadoop as a data platform, Apache Storm for real-time computation, data ingestion and orchestration and Elasticsearch for performing advanced real-time searches. This session will focus on the architectural challenges of bridging batch and real-time systems and how to overcome them, keeping a close eye on performance and scalability. We will cover the architectural topics such as partition strategies, data locality, integration patterns and multi-tenancy.

Real-time Analytics and Anomalies Detection using Elasticsearch, Hadoop and StormCostin Leau
Session Abstract× Close
Watch: VideoSlides

In this talk, we propose PIG as the primary language for expressing realtime stream processing logic and provide a working prototype on Storm. We also illustrate how legacy code written for MR in PIG, can run with minimal to no changes, on Storm. This includes running the existing PIG UDFs, seamlessly on Storm. Though PIG or Storm do not take any position on state, this system also provides built-in support for advanced state semantics like sliding windows, global mutable state etc, which are required in real world applications. Finally, we propose a “Hybrid Mode” where a single PIG script can express logic for both realtime streaming and batch jobs.

Pig on StormKapil Gupta, Mridul Jain
5:25 - 6:05pm
Session Abstract× Close
Watch: VideoSlides

The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype. In this presentation, I will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.

Architecting R into the Storm Application Development ProcessAllen Day
Session Abstract× Close
Watch: VideoSlides

When I speak with other companies on advanced analytics proof of concepts, the focus of their questions skips quickly past the “what” onto the “how” – how did we gain support, how did we find success, how did we decide which technology to select. I will share with you some of the lessons we learned as well as answer many of these questions. This discussion will showcase how Sprint, a major telecommunications company, went from issuing a research challenge to enabling the entire enterprise in the area of analytics. I’ll walk you through how we repurposed an existing team and started with our first Proof of Concept on Hadoop. We are now in the midst of setting up a multi-petabyte enterprise supported Hadoop system with multiple funded projects, are augmenting our research facilities, and have a long list of use case trials in the works.

Gaining Support for Hadoop in a Large Corporate EnvironmentJennifer Lim
Session Abstract× Close
Watch: VideoSlides

In this talk I’ll describe use cases for both batch & real-time similarity, and discuss my experience using Hadoop and Solr to generate high quality results at scale for several different clients. I’ll be covering entity resolution (people & places), false identity detection (fraud), real-time recommendation systems and automatic document linking. These are all based on past projects for real customers. Techniques I’ll discuss are feature extraction from text, distributed SimHash, and Solr-based real time similarity scoring.

Similarity at ScaleKen Krugler
Session Abstract× Close
Watch: VideoSlides

OpenStack is the de-facto standard for open source private clouds. In this talk we will discuss bringing Hadoop and OpenStack together to facilitate Big Data Processing with operational agility and efficient infrastructure utilization. You will learn about OpenStack and the many considerations that must be taken into account when deploying Hadoop on OpenStack, such as the transient nature of the clusters, data movement and storage (ephemeral, cinder, swift), virtualization vs. bare metal, intelligent vm scheduling and multi-tenancy. We will discuss this in the context of the OpenStack Savanna project, which provides simple, repeatable and scalable deployment of Hadoop clusters. We will discuss the current status of the project, where it is headed, and will demonstrate the Elastic Data Processing feature by provisioning a Hadoop cluster on demand and executing a Pig job. We will also discuss some common use cases which can make use of Savanna, such as allowing QA teams to easily provision replicas of deployed Hadoop clusters in minutes as well as the sharing of data across clusters via Swift.

Hadoop and OpenStackMatthew Farrellee, John Speidel
Session Abstract× Close
Watch: VideoSlides

Understanding the future of big data is crucial in the early stages of decision making around big data architectures. In the enterprise, what stands out is the need to integrate Hadoop smoothly into your existing data warehouse architecture, while taking advantage of existing skills and investments. In this session we’ll present a strategy for enabling integrated data management using both Hadoop and relational technologies. In particular, we’ll look at how SQL, long the standard for the data warehouse, is increasingly being used on Hadoop. The real prize, though, is Smart SQL processing, seamlessly integrating the data warehouse and Hadoop into a single, Big Data Management System.

Big Data Management System: Smart SQL Processing Across Hadoop and Your Data WarehouseDan McClary
Session Abstract× Close
Watch: VideoSlides

For a utility such as EDF, real-time data analytics can address smart-grid issues (demand-response, dynamic pricing, real-time demand and production forecasting). It can also leverage customer relationship analyzing tweets for sentiment analysis. CEP (Complex Event Processing) tools emerged about 15 years ago allowing to query streaming data, join data in motion and data at rest or run on-line machine-learning algorithms. Storm has gained a lot of interest lately as it has joined the Hadoop ecosystem. EDF Lab and OCTO Technology conducted a Proof of Concept in order to address the complexity of processing billions of metrics coming from millions of smart meters, keeping in sight performance requirements. In details, we will go through: – The challenge of data streaming analytics for utilities going along with the deployment of smart grids. – How real-time metrics can be implemented using Storm to respond to issues like aggregates, scoring and advanced analytics. We also made use of R on top of Storm for specific algorithms. – The performances achieved on an 8 nodes commodity hardware cluster as well as on a Hadoop cluster with an HDP 2 distribution introducing YARN. – How to quickly build an agile application using iterative development methods, from data source to end users UI.

Real-time Energy Data Analytics with StormRemy Saissy, Marie-Luce Picard
Session Abstract× Close
Watch: VideoSlides

YARN, the next generation compute framework for Apache Hadoop, has a single point of failure in the form of its master – the ResourceManager (RM). The RM keeps track of all the slaves, handles all client interactions, and schedules work in the cluster. Unplanned events like node crashes and planned events like upgrades may reduce the availability of this central service and YARN itself. This session details our recent work on Highly Available Resource Manager (HARM) in YARN.

Highly Available Resource Management for YARNKarthik Kambatla, Xuan Gong
6:05 - 7:30pm Exhibitor Reception


Day 2 » Wednesday, June 4
Tracks:
Deployment and Operations
Hadoop for Business Apps
Future of Hadoop
Data Science
Committer
Hadoop Driven Business
5:30 - 8:00am Bike Meet-up
8:00 - 8:45am Breakfast
8:45 - 10:45am Keynotes and Plenary
10:45 - 11:15am Break
11:15 - 11:55am
Session Abstract× Close
Watch: VideoSlides

Providing interactive analytics over all of Yahoo!’s advertising data across the numerable dimensions and metrics that span advertising has been a huge challenge. From getting results in a concurrent system back in under a second, to computing non-additive cardinality estimations to audience segmentation analytics, the problem space is computationally expensive and has resulted in large systems in the past. We have attempted to solve this problem in many different ways in the past, with systems built using traditional RDBMS to no-sql stores to commercial licensed distributed stores. With our current implementation, we look into how we have evolved a data tech stack which includes Hadoop and in-memory technologies. We will detail out and contrast the strengths of each of these systems and how they complement each other for some of the use cases we see in advertising. We will describe how we have customized these technologies to work in interesting patterns which have helped get to data and analytics quickly than ever before @ Yahoo! The talk will provide a couple of usecase deep dives like how we compute recursive unique counts on the fly and how we compute segment overlap analytics dynamically.

Interactive Analytics in Human Time - Lightning Fast Analytics using a Combination of Hadoop and In-memory Computation Engines at Yahoo!Supreeth Rao, Sunil Gupta
Session Abstract× Close
Watch: Video

In this panel, we will be joined by leading industry analysts to discuss the Hadoop market.  We will attempt to define Hadoop, cover market size, market growth, adoption trends and the future of Hadoop. The audience is encouraged to come armed with questions for this lively and interactive discussion.

Tony Baer – Ovum Research

Mike Gualtieri – Forrester Research

Jeff Kelly – Wikibon

Panel: Analyzing the Hadoop Market
Session Abstract× Close
Watch: VideoSlides

Apache Hive is the most widely used SQL interface for Hadoop. Over the past year the Hive community has been very busy. Hive 0.13 delivers on the promise of Hive running queries in interactive time while maintaining the ability to run at scale on terabytes of data. It also adds signficant extensions to Hive’s SQL, including better datatype support, support for subqueries in the WHERE clause, and standard SQL security with GRANT, REVOKE, and ROLEs. Hive 0.13 begins the integration with the Optiq cost based optimizer. The execution engine has been reworked to take full advantage of YARN and Tez, moving it beyond the confines of MapReduce. It has also been re-written to run optimally on modern processors, using well known techniques from the research literature. In addition it now takes advantage of features in HDFS such as in memory caching of files and zero copy reads from cached files. New file formats such as ORC are providing tighter compression and faster read times. This talk will discuss the feature and performance improvements that have been done, show benchmarks that measure performance improvements so far, look at ongoing work, and discuss future directions for the project.

Making Hive Suitable for Analytics WorkloadsAlan Gates
Session Abstract× Close
Watch: VideoSlides

Companies that harness major transformations early in the game can make serious headway. This is true in the world of Big Data as well. Attend this session to see how your organization can gain competitive advantage by leveraging the scalability, processing power, speed and lower costs that Hadoop 2.0/YARN offers. Learn how YARN has opened the floodgates for a new generation of data quality software that can help you easily put Big Data at the forefront for your organization – and do it without the added long-term expense of hiring MapReduce programmers.

YARN: The Key to Overcoming the Challenges of Broad-based Hadoop AdoptionGeorge Corugedo
Session Abstract× Close
Watch: Video

Previously, upgrading a large Hadoop cluster in a production environment required cluster downtime, coordination with users, catch-up processing, etc. – in one word “overhead”. This overhead creates significant friction for the deployment of enhancements and bug fixes, which in turn leads to greater support costs and lower productivity. Hadoop 2.x provides the foundation required to support rolling upgrades with features such as: High availability, wire compatibility, YARN, etc. This talk will describe the challenges with getting to transparent rolling upgrades, and discuss how these challenges are being addressed in both YARN and HDFS. Specific topics that will be discussed in this talk include: Namenode features to support rollback/downgrade, Datanode features to support fast restart and pipeline recovery, Work-preserving RM restart – RM re-acquires state from running NMs and AMs, and Work-preserving NM restart – NM re-acquires running containers.

Hadoop Rolling Upgrades - Taking Availability to the Next LevelSuresh Srinivas, Jason Lowe
Session Abstract× Close
Watch: VideoSlides

Apache Falcon is a platform for simplifying managing data jobs for Hadoop. We delve into the motivation behind Falcon, use cases, how it aims to simplify standard functions such as data motion (import, export), lifecycle (replication, eviction, DR/BCP) and process orchestration (data pipelines, late data handling, etc.). The presentation covers detailed design and architecture along with case studies on the usage of Falcon in production. We also look at how this compares against solutions if we took a silo-ed approach. User intent is systemically collected and used for seamless management alleviating much of the pains of folks operating or developing data processing application on hadoop.

Apache Falcon - Simplifying Managing Data Jobs on HadoopVenkatesh Seetharam, Srikanth Sundarrajan
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop YARN brings us a step closer to realizing the vision of Hadoop providing a single grid to run all data processing applications. The challenges posed by different applications such as batch computation, interactive queries, stream processing, iterative computation etc. vary widely. While simple stateless services require features like dynamic (re)configuration, service registry, and discovery, more complex stateful fault-tolerant systems like HOYA (HBase on YARN) typically require partition management, failure handling, scale up/scale down. Helix makes it easier to write distributed data applications on top of YARN by providing a generic application master. Helix is a cluster management framework that orchestrates the constraint-based assignment of distributed tasks in a cluster. While YARN allows a one to one mapping of container to task, Helix enables a many to one mapping of tasks to a container. Helix monitors the state of each task and allow one declare the behavior of each task using state machine and enforce constraints. A distributed system’s life cycle consists of building, provisioning, deploying, configuring, handling failures, and scaling with workload; each stage is affected by app constraints. We will explain how Helix with YARN can be leveraged to tackle challenges involved in each stage with minimal configuration.

One Grid to Rule them All: Building a Multi-tenant Data Cloud with YARNKishore Gopalakrishna
12:05 - 12:45pm
Session Abstract× Close
Watch: VideoSlides

Apache Tez is a modern data processing engine designed for YARN on Hadoop 2. Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. With a clear separation between the logical app layer and the physical data movement layer, Tez is designed from the ground up to be a platform on top of which a variety of domain specific applications can be built. Tez has pluggable control and data planes that allow users to plug in custom data transfer technologies, concurrency-control and scheduling policies to meet their exact requirements. The talk will elaborate on these features via real use cases from early adopters like Hive, Pig and Cascading. We will show examples of using the Tez API for targeting new & existing applications to the Tez engine. Finally we will provide data to show the robustness and performance of the Tez platform so that users can get on-board with high confidence. We envisage that almost all existing applications, that were written on top of MapReduce for Hadoop 1, will migrate to Tez for performance and efficiency on top of Hadoop 2. Built by engineers who have developed and operated large scale systems like Apache MapReduce and Microsoft Dryad, Apache Tez builds on the robustness of the past with an eye on agility for the future.

Apache Tez - A New Chapter in Hadoop Data ProcessingBikas Saha, Hitesh Shah
Session Abstract× Close
Watch: VideoSlides

HBase has ACID semantics within a row which makes it a perfect candidate for a lot of real time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to be able to do possibly stale reads while the region is recovering. In this talk, we will give an overview of our design and implementation of region replicas in HBase, which provide eventual consistent reads even when the primary region is unavailable or busy. Regions of replicated tables are opened in primary or secondary mode in different region servers, and the client does writes against only the primary region. The reads are sent to all region replicas in case the primary does not respond with a tight timeout. Secondary regions get updates from the primary by referring to the same data files and tailing the WAL. After this talk, attendees will have a better understanding of the exact CAP and ACID semantics in hbase, learn about the new region replicas feature and how it can be used for high percentile low latency reads.

HBase Read High Availability Using Eventually Consistent Region ReplicasDevaraj Das, Enis Soztutar
Session Abstract× Close
Watch: Video

As more companies use Big Data, the need to simplify, harden for enterprise use, and streamline Hadoop analytic applications will be critical. Limited talent and a fast-evolving ecosystem make it difficult for new entrants to perform on par with veterans, and this situation is expected to persist for some time. AT&T and ecosystem collaborators are bringing new innovations to market and placing several technologies into open source to alleviate this problem. The outcome: easier big data infrastructure, easier visualization of large data sets, easier streaming analytics of complex events, easier management and maintenance of distributed system applications, and easier ways to deliver and maintain big data analytics within the enterprise. This will be a panel discussion on how to Hadoop-enable your enterprise for the long-term. Panelists will include AT&T’s chief of big data capabilities, data scientists from AT&T Labs, the CEO of Continuuity, and the CTO of Hortonworks.

Simplifying HadoopJonathan Gray, Ari Zilka
Session Abstract× Close
Watch: VideoSlides

The past year has seen the advent of several “low latency” solutions for querying big data. The basic premise of Shark, Presto and Impala has been: Hive on MR is too slow for use in interactive queries. At Yahoo, we’d like our low-latency use-cases to be handled within the same framework as our larger queries, if viable. We’ve spent several months benchmarking various versions of Hive (including 13 on Tez), file-formats, compression and query techniques, at Yahoo scale. Here, we present our tests, results and conclusions, alongside suggestions for real-world performance tuning.

Hive on Apache Tez: Benchmarked at Yahoo! ScaleMithun Radhakrishnan
Session Abstract× Close
Watch: VideoSlides

Symantec today owns the largest known security metadata store in the world. Few years ago we built an analytics platform on top of a commercial MPP database engine and addressed several gaps including building an in-house job processing and ingestion management system. In the last few years we are faced with a challenge of ever-increasing data sets, computational complexity around processing our security, storage, and business metadata events we collect every day. To address scale challenges, we reevaluated our technology needs and decided to adopt Hadoop as the core platform. In this presentation we would like to share our experiences in transitioning our ingestion pipelines, data life cycle management, and job management to Hadoop YARN and address the scale problems. We also share our experiences operating Hadoop YARN clusters and optimizing our scale-performance requirements around multiple dimensions: cluster sharing across multiple work loads, performance optimizations for near-interactive Hive queries while meeting throughput needs of batch analytics jobs; address effective resource utilization while combining multi-domain processing across Map Reduce, HBase, and Stream analytics jobs in the same clusters. Experiences learned here are applicable to optimization around operating large clusters as well as tuning the YARN scheduling and configurations.

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARNSrinivas Nimmagadda, Roopesh Varier
Session Abstract× Close
Watch: Video

One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: ingesting JSON data from Hive; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine Shark, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user. This talk will be a fully live demo and code walkthrough where we’ll build up the application throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.

Demo: Building a Unified Data Pipeline in Apache SparkAaron Davidson
Session Abstract× Close
Watch: VideoSlides

See how a national Retailer gets value across all parts of the business from its initiative to transform legacy applications using big data technologies. By implementing Hadoop and Cassandra into its traditional environment, Business Intelligence and Hadoop Teams are able to provide more accurate and real-time inventory, pricing, sales and return data as well as predicting ideal floor plans. Get an inside look at the infrastructure and use cases that are achieving business wins for this retailer operating 1,250 stores across all 50 states as well as in Puerto Rico and Bermuda. A variety of Business Intelligence applications will be covered – from inventory reporting to website data archiving – as well as behind-the-scenes projects to improve the performance of the Hadoop environment.

Big Data Business Wins: Real-Time Inventory Tracking with HadoopAnkur Gupta
12:45 - 1:45pm Lunch
1:45 - 2:25pm
Session Abstract× Close
Watch: VideoSlides

Mihaly Csikszentmihalyi describes Flow Theory in his book “Beyond Boredom and Anxiety” as “…a state of peak enjoyment, energetic focus, and creative concentration experienced by people engaged in adult play, which has become the basis of a highly creative approach to living.” This is the ideal state every company wants their customer to be in when using their product. For gaming companies, their goal is to increase customer acquisition, retention and monetization. This means getting more users to play, play more often and longer, and pay, and the gamers need to be in a state of “flow” for that to happen. However, Flow is a very difficult state to achieve because it requires a fine tuning of perceived reward and challenge. Citing actual customer use cases of several online gaming companies, Stefan Groschupf, the CEO of Datameer, will discuss in depth what exactly Flow Theory is, what kind of data is necessary for proper flow analysis, and how big data analytics can be applied to help products and websites achieve actual “Flow” for optimal user experience.

Applying Big Data to Flow TheoryStefan Groschupf
Session Abstract× Close
Watch: VideoSlides

R is a widely used statistical programming language with a number of extensions that support many machine learning tasks. However, interactive data analysis in R is usually limited as the run-time is single threaded and can only process data sets that fit in a single machine’s memory. To enable large scale data analysis from R, we will present SparkR, an open source R package developed at UC Berkeley, that allows data scientists to analyze large data sets and interactively run jobs on them from the R shell leveraging the power of their Hadoop clusters. SparkR exports Spark’s RDD API in R and makes it easy for programmers to use existing R packages while parallelizing their applications. This talk will introduce SparkR and will discuss the benefits of combining R’s interactive console and extension packages with Spark’s low latency distributed run-time. To highlight these benefits, we will present use cases that show how users can perform advanced analytics on large data sets in Hadoop clusters.

SparkR: Enabling Interactive Data Science at Scale on HadoopShivaram Venkataraman
Session Abstract× Close
Watch: VideoSlides

It used to be black and white. If you needed MapReduce processing, you chose Hadoop; if you needed standard query and reporting, you chose a SQL data warehouse. The decision is no longer clear cut. With YARN clearing the way for Hadoop to accept multiple workloads, Hadoop is no longer your father’s MapReduce machine – as frameworks are rapidly emerging for interactive SQL, search, streaming and other workloads. We are on the path toward a federated world of analytic and operational decision stores, but as the boundaries between platform types grow fuzzier, deciding what platforms to use and where to run which workloads grow trickier.

Hadoop, SQL and NoSQL – No longer an either-or questionTony Baer
Session Abstract× Close
Watch: VideoSlides

This talk will examine on how Apache Phoenix differentiates itself from other SQL solutions in the Hadoop ecosystem by focusing solely on HBase. It will start with exploring some of the fundamental concepts in Phoenix and how this leads to dramatically better performance. Then we’ll discuss how these fundamental concepts are built upon to support more advanced features such as secondary indexing. Finally, we’ll provide an overview of some of the newer Phoenix features, such as shared physical tables surfaced as SQL global and multi-tenant views and how leveraging this capability overcomes some inherent limitations in HBase. The talk will conclude with an overview of our roadmap and how we plan to take Phoenix to the next level by building a cost-based query optimizer for your big data.

Apache Phoenix: Transforming HBase into a SQL databaseJames Taylor
Session Abstract× Close
Watch: VideoSlides

We present our proposal for a Big Data stdlib: the Retainable Evaluator Execution Framework (REEF), which provides a reusable control-plane for scheduling and coordinating task-level work on cluster resource managers. The REEF design enables sophisticated optimizations, such as container re-use and data caching, and facilitates work-flows that span multiple frameworks. Examples include pipelining data between different operators in a relational system, retaining state across iterations in iterative data flow, and more. REEF has been released to open-source under Apache License.

REEF: Towards a Big Data StdlibTyson Condie, Markus Weimer
Session Abstract× Close
Watch: VideoSlides

With big data processing geared towards low latency, Pig on Tez aims to make ETL faster by using Tez as the execution engine instead of MapReduce in a Hadoop cluster. Tez , an Apache Incubator Project, is a distributed execution framework for executing computations as a dataflow graph, that is a more natural fit for the query plan produced by Pig. Pig-on-Tez is a Apache community driven effort led by Hortonworks, Yahoo, Netflix and LinkedIn.. With optimized and shorter query plan graphs, Pig-on-Tez delivers huge performance improvements by executing the entire script within one YARN application as a single DAG and avoiding intermediate storage in HDFS. It also employs a lot of other optimizations made feasible by the Tez framework such as session and container reuse, custom inputs/outputs/processors, multiple inputs and outputs of a vertex, broadcasting of intermediate outputs, having unordered outputs to save on sorting, caching smaller tables in JVM memory during container reuse, custom memory and resource configuration for vertex and edge and much more. Initial tests have shown 2x-3x speedup over the MR framework for simple join and group-by queries. More complex queries are expected to bring in higher level of speedups.

Pig on Tez - Low Latency ETL with Big DataDaniel Dai, Rohini Palaniswamy
Session Abstract× Close
Watch: VideoSlides

Performance is an important but hard to measure concept in HBase. Even with well-known tools such as YCSB, it is difficult to obtain reliable results that translate to production deployments. Previously published HBase performance numbers often inadvertently contain apple-to-oranges comparisons and are not easy to reproduce. It remains difficult to control the impact of hardware, file systems, JVM, auxiliary tasks, and benchmarking tools. Finally, the performance of HBase in the context of concurrently running frameworks such as MapReduce or SolrCloud is little understood and poorly defined. This talk discusses recent advances in measuring HBase performance. We describe test procedures such as controlled cache-warming and the use of pre-generated data which are essential to obtaining reliable and reproducible performance numbers. We further discuss how to define and measure HBase performance in the context of concurrent MapReduce and SolrCloud workloads. Such scenarios present system configuration and test method challenges that extend beyond standalone HBase performance considerations.

Rigorous and Multi-tenant HBase Performance MeasurementYanpei Chen, Govind Kamat
2:35 - 3:15pm
Session Abstract× Close
Watch: VideoSlides

Apache Hive is the de-facto standard for SQL-in-Hadoop today, with more enterprises relying on this open source project than any alternative. Apache Tez is a general-purpose data processing framework on top of YARN. Tez provides high performance out of the box across the spectrum of low latency queries and heavy-weight batch processing. In this talk you will learn how interactive query performance is achieved by bringing the two together. We will explore how techniques like container-reuse, re-localization of resources, sessions, pipelined splits, ORC stripe indexes, PPD, vectorization and more work and contribute to dramatically faster start-up and query execution.

Hive + Tez: A Performance Deep DiveJitendra Pandey, Gopal Vijayaraghavan
Session Abstract× Close
Watch: VideoSlides

Conventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users. In this talk we introduce Grill, a system we have built at InMobi to precisely solve this problem. We have built a simple metadata layer which provides an abstract view over tiered data stores. System is automatically able to pick the right data store based on the cost of the query and the latency goals of the query.

HQL over Tiered Data WarehouseAmareshwari S, Suma Shivaprasad
Session Abstract× Close
Watch: VideoSlides

Anomaly detection is the art of automating surprise. To do this, we have to be able to define what we mean by normal and recognize what it means to be different from that. The basic ideas of anomaly detection are simple. You build a model and you look for data points that don’t match that model. The mathematical underpinnings of this can be quite daunting, but modern approaches provide ways to solve the problem in many common situations. I will describe these modern approaches with particular emphasis on several real use-cases including: a) rate shifts to determine when events such as web traffic, purchases or process progress beacons shift rate b) time series generated by machines or biomedical measurements c) topic spotting to determine when new topics appear in a content stream such as Twitter d) network flow anomalies to determine when systems with defined inputs and outputs act strangely. In building a practical anomaly detection system you have to deal with practical details starting with algorithm selection, data flow architecture, anomaly alerting, user interfaces and visualizations. I will show how to deal with each of these aspects of the problem with an emphasis on realistic system design.

How to Find What You Didn`t Know to Look For, Practical Anomaly Detection AnomalyTed Dunning
Session Abstract× Close
Watch: VideoSlides

Data analysis is a complex process with frequent shifts among data formats and models. Distributed data processing systems such as Hadoop present both new opportunities and challenges for agile data analysis. How can we better support the life cycle of analysis by identifying critical bottlenecks and developing new methods at the intersection of data visualization, machine learning, and computer systems? Can we empower users to transform and clean data without programming? How might we enable domain experts to guide machine learning methods to produce effective models? This talk will present example projects from selected open source and commercial projects like Vega and Trifacta that attempt to address these challenges for interactive analysis.

Transforming Data for Agile & Interactive AnalysisJeffrey Heer
Session Abstract× Close
Watch: VideoSlides

Governments and the private sector, especially when they interact with each other, are engaged in a high level of data exchange in different formats, as based upon internal requirements. Entities are exchanging data and need to provide the proper insight into that data to address legal and regulatory needs, especially when legal counsel is involved. While the management and synchronization of third-party databases are essential to amalgamate data into a single database, entities are facing business rules issues. These business rules are dependent upon a variety of factors, such as advice for legal counsel, regulatory compliance, and the international exchange of data. To address these factors, entities have to adopt solutions, such as Hadoop, to comply with legal and regulatory compliance set forth by either their legal counsel or a government entity. While adoption of Hadoop is seamless with third-party connectors, a platform that provides this cloud hosting with business rules designed to meet their legal needs is rare. Organizations are starting to adopt Big Data, but the essential question remains – how can Big Data address our legal, regulatory, and compliance needs, especially for reporting needs.

Government and Private Sector Regulatory Compliance: Using Hadoop to Address Legal and Regulatory RequirementsTimothy Los, Stephanie Caprini
Session Abstract× Close
Watch: VideoSlides

The primary objective of Ambari is to simplify administration of Hadoop clusters. This entails addressing concerns like fault tolerance, security, availability, replication, zero touch automation and achieving all of this at enterprise scale. This talk focuses on key features that have been introduced in Ambari for cluster administration, and goes on to discuss existing concerns as well as future roadmap for the project. We will be demonstrating new features essential to managing enterprise scale clusters like rolling upgrade and rolling restart which provide admins the ability to manage clusters with zero down time. Other important features are heterogenous cluster configuration using config-groups, maintenance mode for services, components and hosts, bulk operations like decommission, recommission, and ability to visualize Hive queries in the Ambari Web UI which use Tez. Additionally we will show how Ambari facilitates adding new Services and extending existing stack with examples of Storm and Falcon. Lastly, we share how Ambari has been integrated with Microsoft System Center Operations Manager, Teradata Viewpoint, and RedHat GlusterFS.

Managing 2000 Node Cluster with AmbariSiddharth Wagle, Srimanth Gunturi
Session Abstract× Close
Watch: VideoSlides

This leading vendor of electronic consumer peripherals, that has many millions of customers and many thousands of products, was faced with increasing costs for customer support. Conflicting reporting systems and support for custom geo-focused solutions inhibited business agility and inflated costs. Data existed across multiple systems, including customer support web sites, call centers and web forums. It was obvious that replacing multiple reporting systems with a cohesive, cost-effective solution was the only answer. Enter, Big Data. The once cumbersome solution was quickly transformed into an efficient reporting system that reduced cycle times and costs. Using a solution based on Hadoop they were able to get actionable insights and indisputable answers to questions they could not answer before: * Which troubleshooting guides are not effective? * What parts of the end-to-end incident resolution process are most time consuming? * Which geography uses which mode of support? These insights are just the beginning. In this case study, we will cover the methodology, the various actionable metrics developed, the data preparation, and the big data analytics techniques used to achieve and maintain these improvements. In addition, we will provide insights on unique challenges that B2C Enterprise companies face when implementing Big Data solutions in Customer Care.

Customer Support in the Big Data EraTanya Shastri
3:25pm - 4:05pm
Session Abstract× Close

Understanding the customer behavior across multiple channels such as online, retail, email and social is becoming increasing vital to succeed in today?s competitive environment. This paper presents our experience in helping the digital transform of the world?s oldest travel company by designing, developing and implementing a 360 view of the customer base across multiple sales channels. This includes the structural, cultural challenges in defining the single customer view, implementation challenges using Hadoop as our platform to collect and process data, and how we created a modeling layer on the Hadoop platform to deliver insight back to the business, and embed a data-driven culture. The paper also presents two case studies. The first case study is how we transformed the email CRM program by capturing and processing web behavioral data using Hadoop. The second case study is how we designed and implemented a novel social commerce channel in Twitter, and how we processed the resultant social data in Hadoop to deliver insight back into the business.

Using Hadoop to Drive Data and Analytics Transformation in the Travel IndustrySaran Subramanian
Session Abstract× Close
Watch: VideoSlides

When Cardinal Health started the journey to provide a secure analytics platform for healthcare data, experienced Big Data professionals told us that it couldn’t be done, that Big Data platforms are incompatible with segmented security models. Fortunately, we didn’t listen, and have created a platform that provides for a tight security model that exceeds expectations by business partners, and provides for the segregated & secure environment for analytics of logistics, patient records and global ordering and manufacturing. Two principles from the project will provide a detailed and interactive case study of how platform security, identity, tokenization, segmentation, network security, governance and OpSec were combined to create a holistic and robust security platform.

Securing Big Data: Lock it Down, or Liberate?Jeff Graham, Mark Tomallo
Session Abstract× Close
Watch: VideoSlides

HBase is an online database; response latency is critical. This talk will examine sources of latency in HBase, detailing steps along the read and write paths. We’ll examine the entire request lifecycle, from client to server and back again. We’ll look at the different factors that impact latency, including GC, cache misses, and system failures. We’ll also highlight some of the work done in 0.96+ to improve the reliability of HBase. The talk is for people who want to know what to expect out of HBase when it comes to applications with low latency requirements.

HBase: Where Online Meets Low LatencyNicolas Liochon, Nick Dimiduk
Session Abstract× Close
Watch: VideoSlides

The Hadoop ecosystem is rapidly expanding with many emerging SQL-on-Hadoop solutions.  The use of SQL over Hadoop enables business users to reuse their existing SQL knowledge to analyze the data stored in Hadoop and port their SQL-based applications and reporting tools, which helps accelerate the pace and improve the efficiency of adopting Hadoop. The relational database world has decades of research and experience optimizing SQL access and providing the needed capabilities for mission critical environments. In this session the IBM development team will discuss how they leveraged that experience and assets from the relational database world and applied them to Big SQL 3.0, focusing on the challenges that the Hadoop world placed on traditional data warehouse design.

Challenges of Implementing an Advanced SQL Engine on HadoopAdriana Zubiri, Scott C Gray
Session Abstract× Close
Watch: VideoSlides

Anyone that has used Hadoop knows that jobs sometimes get stuck. Hadoop is powerful, and it’s experiencing a tremendous rate of innovation, but it also has many rough edges. As Hadoop practitioners, we all spend a lot of effort dealing with these rough edges in order to keep Hadoop and Hadoop jobs running well for our customers and/or organizations. For this session, we will look at a typical problem encountered by a Hadoop user, and discuss its implications for the future of Hadoop development. We will also go through the solution to this kind of problem using step-by-step instructions and the specific code we used to identify the issue. As a community, we need to work together to improve this kind of experience for our industry. Now that Hadoop 2 has been shipped, we believe the Hadoop community will be able to focus its energies on rounding off rough edges like these, and this session should provide advanced users with some tools and strategies to identify issues with jobs and how to keep these running smoothly.

De-Bugging Hive with Hadoop-in-the-CloudDavid Chaiken
Session Abstract× Close
Watch: VideoSlides

As the volume of data and number of applications moving to Apache Hadoop has increased, so has the need to secure that data and those applications. In this presentation, we’ll take a brief look at where Hadoop security is today and then peer into the future to see where Hadoop security is headed. Along the way, we’ll visit new projects such as Apache Sentry (incubating) and Apache Knox (incubating) as well as initiatives such as Project Rhino. We’ll see how all of this activity is making good on the promise of Hadoop as the future of data management.

The Future of Hadoop SecurityJoey Echeverria
Session Abstract× Close
Watch: VideoSlides

This presentation will compare the pros and cons for hadoop implementation on cloud such as Hortonworks on AWS, Hadoop as a service from company like Amazon AMR, Altiscale and on premise installations. It will talk about the total cost of ownership for each category of Hadoop implementation and share a TCO calculator. There will be multiple categories of cost such as – 1. hardware/infrastructure, 2. Network/communication, 3. License/software, 4. Application development/training 5.On going support cost. Focus will be to bring all hidden and non-hidden cost to visibility. Using the calculator, participant will be able to find their own cost of ownership for their Hadoop cluster and can plan better for project implementation and support. It will also talk about managing risks on vendor viability, loss of intellectual property and control on technical architecture.

Cost of Ownership for Hadoop ImplementationSantosh Jha, Steve Ackley
4:05 - 4:35pm Break
4:35 - 5:15pm
Session Abstract× Close
Watch: VideoSlides

Predicting the most relevant Ad at any point in time for every individual user is the essence of how Rocket Fuel optimizes ROI for an advertiser. One of the factors influencing this prediction is user’s online interactions and behavioral profile. With more than 50 billion user interactions being processed daily, this data runs into several Peta Bytes in our Hadoop warehouse. Running machine learning algorithms and AI on this vast scale requires many practical issues to be addressed. First, behavioral patterns are short lived and to accurately reflect the tendencies of the user, we need to curate and refresh the user profiles as fast as possible: for example, avoiding multiple scans over the raw data, dealing with transient system outages and so on. Second, the difficulty in building models utilizing behavioral profiles without overwhelming our Hadoop cluster: at this scale, frequent refreshes of several models can place an undue burden on even a thousand node cluster. In this talk, we will dive into (a) the practical challenges involved in designing a highly scalable and efficient solution to build behavioral profiles using Hadoop framework and (b) techniques for ensuring reliability & availability of mission critical machine learning pipelines.

How Did You Know this Ad Will be Relevant for Me?Savin Goyal, Sivasankaran Chandrasekar
Session Abstract× Close
Watch: VideoSlides

Discover the benefits of deploying Hadoop in the cloud (no hardware to acquire, no hardware maintenance, unlimited elastic scale, and instant time to value).  If you have already deployed Hadoop on-premise, this session will also provide an overview of the key scenarios and benefits of joining your on-premise Hadoop implementation with the cloud (by doing backup/archive, dev/test or bursting).  Learn how can get the benefits of an on-premise Hadoop that can seamlessly scale with the power of the cloud.

Extending your Hadoop implementations to the cloudMatt Winker
Session Abstract× Close

Last year at Hadoop Summit Europe, Allen Wittenauer gave one of the most attended talks in Hadoop Summit history. It remains one of the most viewed sessions on Youtube. This year, he’ll be on stage for the first time in Hadoop Summit North America history answering your questions about all things operations.

Let's Talk Operations!Allen Wittenauer
Session Abstract× Close
Watch: VideoSlides

Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.

Building a Self-Service Hadoop Platform at LinkedIn with AzkabanDavid Chen
Session Abstract× Close
Watch: VideoSlides

This session will discuss the Hadoop lifecycle when considered in a virtualized context with a specific use case. In order to enable their Hadoop developers, internal IT can facilitate self-service deployment, customization, and resizing capabilities built-in to VMware vCloud Automation Center (vCAC).  With vSphere Big Data Extensions and vCAC, this solution simplifies the process to deploy a Virtual Hadoop cluster on virtualized infrastructure.  For Adobe, this solution reduces the wait-time for developers to experiment with Big Data applications and to quickly iterate on a clean and multi-tenant hardware environment.  This solution also allows for logical segmentation of the compute and data layers for independent elastic scaling

Hadoop-as-a-Service for Lifecycle Management SimplicityAndrew Nelson, Chris Mutchler
Session Abstract× Close
Watch: VideoSlides

Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive 0.13 introduces cost-based optimization for the first time, based on the Optiq framework. Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive 0.13. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.

Cost-based query optimization in HiveJulian Hyde
Session Abstract× Close
Watch: VideoSlides

Luminar is the first big data analytics provider focused specifically on U.S. Latino consumers. The company offers analysis based on empirical insights, rather than a sample-based approach. Apache Hadoop makes this empirical approach work at scale. In 2012, Luminar began collaborating with Hortonworks to deploy a fully-integrated big data architecture. Luminar’s predictive modeling runs on Hortonworks Data Platform v2.0 and empowers companies to make real-time decisions that connect their products with Latino consumers for measurable results. Luminar’s VP of Strategy, Oscar Padilla, and Justin Sears, Hortonworks Product Marketing Manager, will review Luminar’s analysis for the California Milk Processing Board.

GOT DATA? How Hadoop Market Analysis Helped the California Milk Processing Board Better Serve the California Latino MarketOscar Padilla, Justin Sears
5:25 - 6:05pm
Session Abstract× Close
Watch: VideoSlides

With the advent of YARN as part of Apache Hadoop 2, Hadoop clusters evolved from running only MapReduce jobs to a whole new world of running various different applications starting from Apache Tez for interactive/batch applications to Apache Storm for stream-processing. To make the best use of a YARN cluster, there are questions that need to be addressed at various levels. For an administrator managing a YARN cluster, how does one go from configuring Map/Reduce slots to configuring resources and containers? Operations teams now have to deal with a new range of metrics when managing YARN clusters. A YARN application-developer has to understand how to write an efficient application to make the best use of YARN and at the same time gracefully degrading on a busy cluster. In this talk, we’ll will cover YARN best practices from various perspectives – administrators and developers. We’ll describe how administrators can configure a YARN cluster to optimally use the resources depending on the kind of hardware and types of applications being run. We’ll focus on managing a cluster shared across numerous users, how to manage queues and do capacity allocation across different business units. For developers, we’ll cover how to interact with the various components of YARN and focus on the implicit features that all applications need to be built with such as security and failure handling.

Apache Hadoop YARN: Best PracticesZhijie Shen, Varun Vasudev
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop excels at large computations. R excels at static visualization. When used together, R and Hadoop can be used to produce compelling videos that can explain difficult relationships even more effectively than static pictures. This isn’t easy, however, and even if a picture is worth a thousand words and a video a few more, you need to have tools that make it easier to produce a video than it is to write an essay. I will demonstrate how standard tools like R can produce videos (slowly) and how Hadoop can be used to do this quickly. Sample code will be provided that illustrates just how this can be done. I will also demonstrate how you can adapt these examples to your own needs.

Hadoop and R Go to the Movies, Visualization in MotionTed Dunning
Session Abstract× Close
Watch: VideoSlides

Apache Hive provides a convenient SQL query engine and table abstraction for data stored in Hadoop. Hive uses Hadoop to provide highly scaleable bandwidth to the data, but does not support updates, deletes, or transaction isolation. This has prevented many desirable use cases, such as updating of dimension tables or doing data clean up. We are implementing the standard SQL commands insert, update, and delete allowing users to insert new records as they become available, update changing dimension tables, repair incorrect data, and remove individual records. Additionally, we will add ACID compliant snapshot isolation between queries so that queries will see a consistent view of the committed transactions when they are launched regardless of whether they are a single or multiple MapReduce jobs. This talk will cover the intended use cases, architectural challenges of implementing updates and deletes in a write once file system, performance of the solution, as well as details of changes to the file storage formats and transaction management system.

Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveOwen O`Malley, Alan Gates
Session Abstract× Close
Watch: VideoSlides

Low-latency SQL is the Holy Grail of Hadoop platforms, enabling new use cases and better insights. A number of open-source projects have sprung up to provide fast SQL querying; which one is best for your cluster? This session will present results of our in-depth research and benchmarks of Facebook Presto, Cloudera Impala and Databricks Shark. We look at performance across multiple storage formats, query profiles and cluster configurations to find the best engine for a variety of use cases. This session will help you to pick the right query engine for new cluster or get most out of your existing Hadoop deployment.

Low-Latency SQL on Hadoop - What's Best for Your Cluster?Danil Zburivsky
Session Abstract× Close
Watch: VideoSlides

In past years Cisco validated BigData clusters in both OpenSource as well as vendor specific environments.  We also explored questions often encountered during the deployment of hadoop clusters in the enterprise – including the impact of simultaneous multiple jobs, mixing variety of workload, visibility and monitoring.  This session will take a giant step forward, showcasing next generation technologies that bring unprecedented simplicity, programmability and massive scalability to BigData clusters.  Topics discussed would help take a holistic approach to a BigData cluster design, enabling full potential of the cluster for BigData workloads.

BigData Clusters RedefinedSamuel Kommu
Session Abstract× Close
Watch: VideoSlides

Big Data is changing analytics, while Hadoop is becoming an attractive platform for the distributed storage and processing of structured and unstructured big data sets. But enterprises are realizing the mere storage of data does not offer any inherent value — it is the analysis of the data which transforms it into actionable insights. Predictive Analytics is one of the most exciting and beneficial type of analysis for Big Data, going beyond identifying what or why something has happened, to help determine what can happen. This session will review In-Hadoop, In-Database, and In-Memory processing for predictive analytics, including advantages and disadvantages for each, and benchmark runtimes on a predictive analytics platform with a Hadoop connector.

In-Hadoop, In-Database, and In-Memory Processing for Predictive AnalyticsIngo Mierswa
6:05 - 10:00pm Hadoop Summit Party


Day 3 » Thursday, June 5
Tracks:
Deployment and Operations
Hadoop for Business Apps
Future of Hadoop
Data Science
Committer
Hadoop Driven Business
8:00 - 8:45am Breakfast
8:45 - 10:30am Keynotes and Plenary
10:30 - 11:00am Break
11:00 - 11:40pm
Session Abstract× Close
Watch: VideoSlides

Ancestry’s direct-to-consumer DNA test, AncestryDNA, has been using Hadoop and other open source projects (HBase, Azkaban, etc.) to handle-at scale-the ethnicity predictions and processing behind its genetic cousin matching algorithm. As our DNA pool continues to grow hundreds of thousands of users, and we continue to improve our algorithms, we are faced with new science and technical problems that need to be overcome. This talk will cover: – The principles followed with this project. The team measures every step in the pipeline for every run and this has proved invaluable. – The science behind processing steps. – How we scaled the matching step using HBase and Hadoop. We’ll walk you through a matching example to show exactly how science and technology merge to solve a business problem. – We’ll dig in and show how the science and development teams collaborated on this project, through various updates, to deliver a significantly improved user experience. – What’s next? Why is this project disruptive? Companies have been trying to figure out how to create a business with DNA testing for quite some time with limited success. Ancestry.com is delivering a unique product that provides meaningful family history discoveries for its users.

Scaling AncestryDNA with the Hadoop EcosystemBill Yetman
Session Abstract× Close

Apache HBase powers multiple mission critical applications, and is one of most significant HDFS use cases at Facebook. However, providing a highly reliable online storage system on top of a single HDFS cluster has been challenging. Large-scale disruptive events, such as power failures or network partitions, can cause HDFS unavailable, or in a downgraded mode due to large numbers of under-replicated blocks. Our current solution is to provide eventual consistency by asynchronously replicating HBase transactions between HDFS clusters in a Master-Slave model. This model, however, is prone to causing data inconsistency issues, and places a burden on application developers to deal with those inconsistencies. HydraBase, an iteration of HBase being developed at Facebook, is built to provide a highly reliable and strongly consistent storage service across multiple geographically dispersed HDFS clusters with a quorum-based synchronous replication protocol. HydraBase is also designed to decouple HBase Region level logical replication away from HDFS block level physical replication. As a result, HydraBase can provide higher Region level availability by enhancing the logical replication factor without introducing additional storage cost. This presentation will cover the design of HydraBase, including the replication protocol, and present an analysis of failure scenario and benchmarking.

HydraBase: a Highly Reliable and Strongly Consistent Storage Service Based on Replicated HBase InstancesRishit Shroff
Session Abstract× Close
Watch: VideoSlides

As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations. Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.

Costing Your Big Data OperationsSumeet Singh, Amrit Lal
Session Abstract× Close
Watch: VideoSlides

One of the most common use-cases on Hadoop is dealing with changed data and delta detection between files in HDFS.  As organizations shift more workloads from traditional databases to Hadoop one of the first problems they encounter is determining what data has changed since the last load to HDFS and Hive.  This is important because data scientists often use Hive and other SQL like interfaces to query Hadoop data for analysis.  In this session you will learn step-by-step how to easily process changed data and detect specific attribute changes between files to implement an effective update strategy on Hadoop.

Dealing with Changed Data in Hadoop – an old warehouse problem in a new world Kunal Jain
Session Abstract× Close
Watch: VideoSlides

YARN is getting increasingly popular in running “jobs” other than MapReduce jobs. We are seeing a number of long running services being run under YARN. YARN provides a “flexible resource sharing model” that makes it attractive for many services to co-exist on a single cluster. We will present the state of the union for YARN with an emphasis on the “long running service”- what has been done to support long running services better and what remains to be done. Subsequently, we will dive into what we are doing to easily, and in an operationally viable way, run existing services (like HBase) on top of YARN. Management of YARN applications, due to the their dynamic nature, bring in own challenges. Our goal is to provide parity between managing a standalone service and when the same service is hosted as an YARN application while making the administrator’s experience richer by leveraging YARN capabilities for dynamic resource monitoring and scheduling. We will present a blueprint for cluster administrators to run their services on a YARN cluster, and, what they need to do to manage and monitor the services, especially, as the services go through failure handling, capacity expansion and reduction.

Bring your Service to YARNSumit Mohanty
Session Abstract× Close
Watch: VideoSlides

The intention of this session is to take a light-hearted look at a very technical topic. Following in the footsteps of the highly popular Java Puzzlers talks from Josh Bloch, Neal Gafter, and Bill Pugh, the structure of the talk will be to dissect a series of code samples that look innocuous, but whose behavior is anything but obvious. After presenting the code and explaining its apparent function, we will present a multiple choice list of possible results and ask the audience to vote for the right answer by show of hands. After the audience has ruled, we’ll reveal the actual behavior, talk through why it happens, and draw from the example lessons that can be put to practical use. The target audience is Hadoop developers who have at least a basic understanding of HDFS, MapReduce, and how to develop Hadoop jobs. Attendees will learn a series of best practices that will hopefully save them hours of debugging time and frustration further down the road.

Hadoop PuzzlersAaron Myers, Daniel Templeton
Session Abstract× Close
Watch: VideoSlides

80% of the effort behind a big data project is devoted to getting data in and out of the Hadoop platform: This is an old problem with a new name! Join this session for a technical overview of the 5 essential integration patterns that any Big Data integration solution must support, Big Data Governance requirements, and the immediate benefits of deploying a comprehensive Big Data Integration solution.

Faster, Cheaper, Easier…and Successful! Best Practices for Big Data IntegrationTony Curcio
11:50 - 12:30pm
Session Abstract× Close
Watch: VideoSlides

Hadoop YARN enables a variety of application paradigms to be executed on a shared cluster of computers. At Yahoo, we have been working with open source communities to bring MapReduce and additional applications onto YARN. In this talk, we will cover an effort to empower Spark applications via Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed onto your existing Hadoop hardware (without creating a separate cluster). Spark applications can then directly access Hadoop datasets on HDFS. In Spark-on-YARN, Spark applications are launched in either standalone mode (which executes the Spark master in a YARN container) or in client mode (which executes the Spark master within user’s launcher environment). Spark-on-YARN has been enhanced to support authentication, secure HDFS access, Hadoop distributed cache and linking Spark UI to YARN UI. In this talk, we will provide technical overview of Spark-on-YARN design and key features, and explain standalone-mode vs. client-mode via use cases. Spark-on-YARN has already been used by companies such as Taobao in production. Various Yahoo teams have also found Spark-on-YARN invaluable. We will share some Yahoo stories on Spark-on-YARN adoption, and discuss potential enhancements of Hadoop to optimize the execution of Spark applications.

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterThomas Graves, Andy Feng
Session Abstract× Close
Watch: VideoSlides

The early adopters of Hadoop demonstrated that the platform provided an economically viable alternative to storing and processing data. The first enterprise IT processes migrated to the Hadoop platform were for extracting data from existing systems, processing that data, and organizing it to be consumed by business stakeholders (Extract, Transform, Load). But as Hadoop continues to see adoption and is quickly becoming a standard enterprise technology across industries, early adopters have already moved beyond ETL to deliver on use cases with more direct business value: collecting omni-channel data, targeting offers and recommendations, and optimizing customer experience — all in real time. This talk will explore some of the technical challenges in supporting these use cases, and the successful (and not so successful) architecture patterns we’ve seen in industry early adopters.

Move Beyond ETL: Tapping the True Business Value of HadoopGarrett Wu
Session Abstract× Close
Watch: VideoSlides

This talk highlights Hadoop case studies that illustrate real-life improvements achieved by telecom service providers and media distribution companies in functions such as: Network infrastructure -Optimization of mobile and fixed network infrastructure investment -Real-time bandwidth allocation for quality of service -Security analytics for intrusion detection Service and billing -CDR analytics and archiving for compliance, billing, and congestion monitoring -Contact center log analytics -Fraud reduction for pre-paid mobile services Marketing and sales -Cross-channel 360 degree view of the customer -Personalized marketing campaigns, notably for upselling and cross-selling -Next best offer predictive recommendations at the point of sale -Targeted individualized TV advertising -Big data as a product for retailers and advertisers -New product development

Hadoop Boosts Profits in the Media and Telecom IndustryJuergen Urbanski
Session Abstract× Close
Watch: VideoSlides

Hadoop is the Big Data system of choice for storing and processing vast quantities of structured and unstructured data. Hadoop has grown beyond the initial Internet industry use case of web log processing, and has found its way into mission critical transaction processing applications. However, there are critical Hadoop services that are still implemented using a single active server. Paxos is a mathematically proven esoteric family of protocols used for achieving state machine replication. This presentation discusses the use of Paxos to create multiple active, shared nothing, replicated servers that are Continuously Available. A thin layer of Paxos based distributed co-ordination is installed in front of the HDFS NameNode, a well known single point of failure. The result is Continuously Available HDFS that runs over the WAN. Similarly, replication of HBase Regions by the using a thin Distributed Co-ordination layer front of Region Servers results in HBase that never goes down as a result of Region Server failure. Strong contrast is drawn between the Paxos based Active Active server technology presented here, and the Active-Standby High Availability solutions included in base Apache Hadoop.

NonStop Hadoop - Applying the Paxos family of protocols to make critical Hadoop services Continuously Available and Resilient to DataCenter FailureJagane Sundar
Session Abstract× Close
Watch: VideoSlides

In modeling intelligent systems for real world applications, one inevitably has to deal with uncertainty. Bayesian networks are well established as a modeling tool for expert systems in domains with uncertainty, mainly because of their powerful yet simple representation of probabilistic models as a network or graph. They are widely used in fields such as genetic research, healthcare, robotics, document classification, image processing and gaming. Working with large-scale bayesian networks is a computationally-intensive endeavor. In this talk we will describe our experience working with R and Hadoop to implement large scale bayesian networks: 1. Quick overview of bayesian networks and example applications 2. Building a bayesian network: by hand and with R 3. Inferring data from a bayesian network 4. Dealing with large-scale networks with R and Hadoop You will learn about some of the challenges when dealing with large-scale bayesian networks and how to deal with these challenges using R and Hadoop.

Bayesian Networks with R and HadoopOfer Mendelevitch
Session Abstract× Close
Watch: Video

The broader Hadoop Ecosystem continues to see the emergence of alternative file-systems and computational runtimes that are inter-operable with Hadoop Core. This talk describes new features being added to the Apache Ambari project that allow for simpler and faster inclusion of service stacks, as well as the ability to deploy stacks on a variety of alternative Hadoop File Systems.

Managing Hadoop Ecosystem Interoperability with Apache AmbariScott McClellan
Session Abstract× Close
Watch: VideoSlides

Administering a Hadoop cluster isn’t easy. With vast and sometimes confusing config/tuning options, it can can tempting (and scary) for a cluster administrator to make changes to Hadoop when cluster performance isn’t as expected. There can also be organizational pressure from job developers to fine-tune esoteric settings. After more than a year in the field with numerous customers, the speaker has hands-on time with dozens of Hadoop clusters. Many Hadoop clusters suffer from common Linux configuration issues that can negatively impact performance. Attendees — ideally cluster administrators and power users — will learn learn how to to improve Hadoop cluster performance and eliminate common problem areas, applicable across use cases, using a handful of simple Linux configuration changes. Problem areas include proper name resolution and caching, swapping, file access time, reclaiming disk name, overcoming file handle limits, and disk management techniques. Bonus/extras depending on time and changing nature of the Hadoop ecosystem.

Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
12:30 - 1:30pm Lunch
1:30 - 2:10pm
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop YARN is the default platform for running distributed apps – batch & interactive apps and long running services. A YARN cluster may run lots of apps of different frameworks and from different users, groups and organizations. It’s of significant value to monitor and visualize what has happened to these apps, i.e., application history, to glean important insights – how their performance changes over time, how queues get utilized, changes in workload patterns etc. It’s also useful to ensure application history accessible whether apps are finished, or failed for some reasons, such as master restart, crash or memory pressure. In this talk, we’ll describe how YARN enables storage of all sorts of historical information, both generic and framework-specific, of any kinds of apps, and how YARN exposes the historical information and provide users the tools to view it, conduct any analysis, and understand various dimensions of YARN clusters over time. We’ll cover a number of technical highlights, such as persisting information into a pluggable & reliable storage like HDFS, establishing a history-server for users to easily access via command-line tools, web & REST interfaces in a secure manner, and enabling apps to define and publish framework specific information. Moreover, the talk will also brief developers and administrators about how to make use of the new YARN feature.

Analyzing Historical Data of Applications on Hadoop YARN: for Fun and ProfitZhijie Shen, Mayank Bansal
Session Abstract× Close
Watch: VideoSlides

Providing pricing and valuation for Public Infrastructure products and services is one of Penton Media’s high value businesses and our current data processing architecture providing this insight was struggling to keep up with new demands: – Customers were increasingly demanding near real-time price updates – The data sources used to compute the prices was getting more diverse, volume of data was ever-increasing and coming in at a higher velocity than before – The algorithms used to calculate prices was getting more complex and resource intensive – Business Unit wanted more flexibility in monetizing the pricing data in many new ways We realized that we needed a transformative architecture that could meet these current demands and also be future-proof. The traditional approach of SQL-only based mapping and transformation was not going to cut it. Hadoop and HBase in combination with SQL provided the perfect platform for harnessing the data, In this talk, we will present our experience in transforming a SQL-based architecture to a newer hybrid (SQL & Hadoop) architecture. We will discuss the role of specific Hadoop components – Flume, Sqoop, Oozie, and HBase, their interaction with SQL, and the performance and scalability characteristics of the platform. We will close the talk by assessing Hadoop’s impact in addressing the business problems we set out to solve.

A Scalable Data Transformation Framework using the Hadoop EcosystemRaj Nair, Kiru Pakkirisamy
Session Abstract× Close
Watch: VideoSlides

Many data science experiments over last few years have led to conclusion that More Data Beats Better Algorithms. Hadoop, the elephant, (HDFS to be specific) provides us distributed, scalable and fault tolerant storage platform to capture ever increasing data. Building predictive models leveraging this Big Data is a challenging task, because many machine learning algorithms are iterative in nature and have large memory requirements. Many open-source toolkits have applied different approaches to address this challenge. Apache Mahout is a library containing parallel implementations of machine learning algorithms leveraging map-reduce framework in Java. MLlib from Berkeley Data Analytics Stack (BDAS) leverages distributed in-memory spark framework to reduce IO dependencies. MADLIB library leverages the concepts of MPP share nothing technology (like Impala) to provide parallel and scalable Machine Learning on Hadoop. The H2O math algorithms run on top of H2O’s own highly optimized MapReduce implementation inside H2O nodes. ClouderaML is collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks As part of this session, we would like to share our experience and findings on strengths and weaknesses of each as well as the scenarios in which one outperforms the other.

Big Data Analytics - Open Source Toolkits - How Do They Fare Against Each Other?Prafulla Wani, Snehalata Deorukhkar
Session Abstract× Close
Watch: VideoSlides

During this session we would like to introduce you to provisioning a Hadoop cluster using Docker containers on different environments and cloud providers. Our goal is to automate the whole process, and allow users to span up an arbitrary number of cluster nodes without any pre-configuration. We use Docker, Serf and dnsmasq to fix the usual service discovery issues and have an installed (but not configured) Hadoop cluster, and we rely on Apache Ambari to configure, manage and monitor the cluster. We leverage the new features of Apache Ambari such as blueprints – and let users add, remove and change Hadoop services dynamically. The end result is a Hadoop as a Service API – all build on open source components exclusively.

Docker based Hadoop provisioningJanos Matyas
Session Abstract× Close
Watch: VideoSlides

This session will introduce the Apache Knox Gateway. The Apache Knox Gateway is a REST API gateway that provides perimeter security for many of Hadoop’s most important REST APIs such as those for WebHDFS, WebHCat, Oozie and HBase. It also supports non-REST HTTP interfaces such as Hive’s JDBC access over HTTP. By securing these APIs at the perimeter, the Gateway provides a central point of authentication, authorization and auditing of these interactions.

Hadoop REST API Security with the Apache Knox GatewayKevin Minder, Larry McCay
Session Abstract× Close
Watch: VideoSlides

Are you leveraging your data to make real-time decisions? You may think that Hadoop cannot run a MapReduce job in milliseconds to seconds. However, by combining Hadoop with distributed, in-memory storage, organizations in diverse industries, including retail, finance, telecommunications, and transportation, can gain operational intelligence on “live” data and immediately respond to incoming events. Using a spectrum of real-world use cases, this session will explain how Hadoop MapReduce can be integrated with the technology of in-memory data grids for real-time analytics. The unique benefits of this approach will be compared to streaming architectures such as Spark streaming and Storm.

Operational Intelligence Using HadoopWilliam Bain
Session Abstract× Close

Installation of a Hadoop cluster is complex and simple automation can help you avoid some of the most common challenges with this task. Best practices have emerged for installation and patterns have been developed for some of the most common cluster configurations, such as Kerberos, LightOutManagement and others. Please join this presentation where we will present a working solution to automated Hadoop cluster installation. A live demo will be presented to walk thru the step by step of how to complete this.

A Fully-automated Hadoop InstallationLeonid Fedotov, Oleg Checherin
2:10 - 3:25pm
Session Abstract× Close
Watch: VideoSlides

One of the most commonly asked questions about Storm is how to properly size and scale a cluster for a given use case. While there is no magic bullet when it comes to capacity planning for a Storm cluster, there are many operational and development techniques that can be applied to eek out the maximum throughput for a given application. In this session we’ll cover capacity planning, performance tuning and optimization from both an operational and development perspective. We will discuss the basics of scaling, common mistakes and misconceptions, how different technology decisions affect performance, and how to identify and scale around the bottlenecks in a Storm deployment.

Scaling Storm - Cluster Sizing and Performance OptimizationP. Taylor Goetz
Session Abstract× Close
Watch: VideoSlides

Figuring out what really matters in data science can be very hard. The set of algorithms that matter theoretically is very different from the ones that matter commercially. Commercial importance often hinges on ease of deployment, robustness against perverse data and conceptual simplicity. Often, even accuracy can be sacrificed against these other goals. Commercial systems also often live in a highly interacting environment so off-line evaluations may have only limited applicability. I will show how to tell which algorithms really matter and go on to describe several commercially important algorithms such as Thompson sampling (aka Bayesian Bandits), result dithering, on-line clustering and distribution sketches and will explain what makes these algorithms important in industrial settings.

How to Determine Which Algorithms Really MatterTed Dunning
Session Abstract× Close
Watch: VideoSlides

Presto is a new open source distributed SQL query engine for interactive analytics on big data. In this talk we will discuss why we built Presto and how we designed it to perform at interactive speeds. We will also dig into how we take advantage of the Presto plugin system to work with data in stores such as HBase, Cassandra, Mysql, etc.  Finally, we will discuss the upcoming improvements from Facebook and other community members. Presto is an open source project released under the Apache License. The project has an active community outside of Facebook using and contributing to the project. It’s also being used in many other companies such as AirBnb and Dropbox.

Presto @ Facebook: Past, Present and FutureMartin Traverso
Session Abstract× Close
Watch: VideoSlides

If business intelligence and analytics users are going to successfully leverage Hadoop, then now is the time to address the challenges. This presentation reviews real-world scenarios in which organizations using Hadoop experience data center downtime or access problems, examines the impact on business, and discusses how these can be avoided. It will also take a look at how enterprise vendors in 2014 can move the visibility and value of Hadoop beyond the data center and into the business unit.

Hadoop: Making it Work for the Business UnitJim Vogt
Session Abstract× Close
Watch: Video

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; it is important for a format to be queried quickly and efficiently. For interoperability, row based encodings (CSV, Thrift, Avro) combined with a general purpose compression algorithm to reduce storage cost (GZip, LZO, Snappy) are very common but are not efficient to query. As discussed extensively in the database literature, a columnar layout with statistics on optionally sorted data provides vertical and horizontal partitioning thus keeping IO to a minimum. Understanding modern CPU architecture is critical to designing fast data specific encodings enabled by columnar layout (dictionary, bit-packing, prefix coding) that provide great compression for a fraction of the cost of general purpose algorithms. The 2.0 release of Parquet is bringing new features enabling faster query execution. We’ll dissect and explain the design choices to achieve all three goals of interoperability, space and query efficiency.

Efficient Data Storage for Analytics with Parquet 2.0Julien Le Dem, Nong Li
Session Abstract× Close
Watch: VideoSlides

Apache Spark provides primitives for in-memory cluster computing, which is well suited for large-scale machine learning purposes. MLlib is a Spark subproject providing machine learning primitives on top of Spark, initially created and contributed to Spark by UC Berkeley. Along with the rapid adoption of Spark, MLlib has received more and more attention and contributions from the open source machine learning community. This talk is to summarize the recent developments in MLlib, including improvements to existing machine learning algorithms and newly added algorithms and functions, for example, sparse data support, principal component analysis, tree-based models, and performance testing. We will show benchmark results on its performance and scalability and demonstrate how to create a practical machine learning pipeline with MLlib. Besides, we will share some key statistics of the open source community involved in the developments. We will also reveal the roadmap with planned features collected from the community and the direction we are going in general.

Recent Developments in Spark MLlib and BeyondXiangrui Meng
Session Abstract× Close
Watch: VideoSlides

Integrating online transaction processing systems with Hadoop at scale is critical to the success of LinkedIn’s data driven products and business insights. Lumos is a home grown production system at LinkedIn that integrates various source of truth data stores like Espresso, MySQL and Oracle spread across multiple data centers by ingesting and organizing Terabytes of data on the Hadoop Data Warehouse in low latency with SLA for effective query processing.

Bridging OLTP with OLAP: Lumos on HadoopVenkatesan Ramachandran
Print Program