Print Program

Schedule

Keynote

Matt Aslett
Research Director, Data Management and Analytics
451 Research

What is the point of Hadoop?

The flexibility of Apache Hadoop is one of its biggest assets – enabling businesses to generate value from data that was previously considered too expensive to be stored and processed in traditional databases – but also results in Hadoop meaning different things to different people. In this session 451 Research’s Matt Aslett will explore the impact that Hadoop is having on the traditional data processing landscape, examining the expanding ecosystem of vendors and their relationships with Apache Hadoop, investigating the increasing variety of Hadoop use-cases, and exploring adoption trends around the world.

  

Eric Baldeschwieler
Chief Technical Officer and Founder
Hortonworks

Hadoop Now, Next and Beyond

With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Apache Hadoop is fast becoming the defacto platform for processing Big Data.

Hadoop started from a relatively humble beginning as a point solution for small search systems. It’s growth into an important technology to the broader enterprise community dates back to Yahoo’s 2006 decision to evolve Hadoop into a system for solving it’s internet scale big data problems. Eric will discuss the current state of Hadoop and what is coming from a development standpoint as Hadoop evolves to meet more workloads.

 

Shaun Connolly
VP Corporate Strategy
Hortonworks

Hadoop’s Role in the Enterprise Architecture

With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Organizations that embrace solution architectures focused on maximizing the value from ALL data will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities. Over the coming years, Hadoop could be in a position to process more than half the world’s data. There is still much work to be done, however, if Hadoop is to achieve this lofty goal. In this talk Shaun Connolly, VP Corporate Strategy for Hortonworks, will look at Hadoop’s role in the enterprise architecture and how it compliments existing enterprise systems.

 

Panel:Panelists from HSBC, eBay, Neustar, and more

Real-world Insight into Hadoop in the Enterprise

How are organizations using Hadoop today? What problems were they looking to solve when they selected this open source infrastructure? What have the challenges been in using Hadoop? What new insights were gained? What will Hadoop’s role be going forward? These questions and more will be discussed live and onstage. We’ve assembled a panel of  experts from across a range of industries to provide real-world insight into the how’s and why’s behind Hadoop. This not to be missed session will be moderated by Hortonworks’ President Herb Cunitz.

Day 1 » Wednesday, March 20
Tracks:
Applied Hadoop
Operating Hadoop
Hadoop Futures
Integrating Hadoop
12:00-1:00pm Exhibit Hall Opens with Lunch
1:00-2:30pm Plenary Sessions
2:30-3:00pm Break in exhibit hall
3:00-3:40pm
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. This really changes the game to recast Hadoop as a much more powerful data-processing system. As a result Hadoop looks very different from itself 12 months ago. Now, ever wonder what it might look like in 12 months or 24 months or longer? This talk will take you through some ideas for YARN itself and the many myriad ways it`s really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.

Past, Present and Future of Data Processing in Apache HadoopArun Murthy
Session Abstract× Close

As any good engineer knows, “if you cannot measure it, you cannot improve it”; benchmarking is the quantitative foundation of any computer system design and research. As Hadoop-based big data framework grows in pervasiveness and scale, realistically benchmarking Hadoop systems becomes critically important to the Hadoop community and industry. In this session, we will discuss our recent work on HiBench (an open source Hadoop benchmark suite widely used by Hadoop users) to further improve its representativeness (e.g., a complete Hadoop-based ETL pipeline, an Iometer-style benchmark tool for HDFS, etc.). In addition, we will also review different existing benchmarking approaches for Hadoop (e.g., trace-based benchmarks such as GidMix3 vs. HiBench) to understand their tradeoffs, and share our thoughts and efforts on fostering a community for Hadoop benchmarking.

How to Weigh and ElephantJason Dai, Vin Sharma
Session Abstract× Close
Watch: VideoSlides

Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.

Parallel Linear Regression on Iterative Reduce and YARNJosh Patterson
Session Abstract× Close
Watch: Slides

This talk will detail the HSBC Big Data journey to date walking through the genesis of the Big Data initiative which was triggered by continual challenges in delivering data driven products. The global scale, diversity and legacy of an organization like HSBC presents challenges for Hadoop adoption not typically faced by younger companies. Big Data technologies are by their very nature disruptive to the established Enterprise IT environment. Hadoop and the peripheral toolsets in the big data ecosystem do not fit comfortably into an Enterprise Data Centre, IT Operational processes and can even prove disruptive to current organization structures. Alasdair will focus on the steps that HSBC has taken to mitigate concerns about Hadoop and raise awareness of the game changing benefits a successful adoption of the technology will bring. HSBC have taken an innovative approach to proving out the value of the technology engaging developers with a brakes off opportunity to use the platform and by placing Hadoop in a competitive scenario with traditional technologies. The Hadoop journey in HSBC was initiated in Scotland, blessed in London and proved out in China?

Enterprise Integration of Disruptive TechnologiesAlasdair Anderson
Session Abstract× Close
Watch: VideoSlides

Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets. It is inspired by Google?s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. This session first introduces Apache Drill and its use cases. We will render its relation to Hadoop and provide details on the progress of the project. Then, we will delve deeper into the Apache Drill architecture, the data flow and the query languages. Last but not least, we will look into the data formats supported and help the audience understand the value of Apache Drill. [1] http://incubator.apache.org/drill/

Understanding the value and architecture of Apache DrillMichael Hausenblas
3:50-4:30pm
Session Abstract× Close
Watch: VideoSlides

The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA and Federation. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as Snapshots and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude we discuss the reliability, the state of HDFS adoption and some of the misconceptions and myths about HDFS.

HDFS - What is New and FutureSanjay Radia, Suresh Srinivas
Session Abstract× Close
Watch: VideoSlides

Many people have talked about how to deploy Hadoop. But few people have presented a view as to how it fits into the larger picture of a data center. With this presentation, the intent is to cover a complete overview of how we go from bare metal to a secure, working grid and all the components (LDAP, Kerberos, Cobbler, bcfg2, etc) in play.

Hadoop Operations at LinkedInAllen Wittenauer
Session Abstract× Close
Watch: VideoSlides

Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.

Crowd-sourced intelligence built into Search over HadoopTed Dunning, Ivan Provalov
Session Abstract× Close
Watch: Slides

Libraries collect books, magazines and newspapers. Yes, that?s what they always did. But today, the amount of digital information resources is growing at dizzying speed. Facing the demand of digital information resources available 24/7, there has been a significant shift regarding a library?s core responsibilities. Today?s libraries are curating large digital collections, indexing millions of full-text documents, preserving Terabytes of data for future generations, and at the same time exploring innovative ways of providing access to their collections. This is exactly where Hadoop comes into play. Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, linguistic processing, etc., require significant computing resources. Many data processing scenarios emerge where Hadoop might become an essential part of the digital library?s ecosystem. Hadoop is sometimes referred to as a hammer where you have to throw away everything that is not a nail. To remain in that metaphor: we will present some actual use cases for Hadoop in libraries, how we determine what are the nails in a library and what not, and some initial results.

The Elephant in the LibraryClemens Neudecker, Sven Schlarb
Session Abstract× Close
Watch: Video

Journaling is a common technique used to guarantee recoverability and it has been used extensively with file systems and databases; the HDFS Namenode and the HBase WAL are examples of such systems. We designed Apache BookKeeper to be a building block for such recoverable systems. A BookKeeper storage server, called bookie, is able to serve concurrently tens of thousands of journals, called ledgers, without dropping aggregate throughput. The design of BookKeeper also includes a number of desirable features, such as: replicating and striping journal entries for fault tolerance and performance; using a pool of bookies that can expand and contract online; automatically recovering from bookie crashes. These features have been exercised in production at Yahoo! with a platform that serves push notifications. The platform is built on top of Hedwig, a topic-based pub-sub system that uses BookKeeper to persist messages and guarantee delivery. Our notifications platform requires a very large number of topics (tens to hundreds of millions) and consequently a large number of ledgers. Other applications, such as the HDFS Namenode Journal, are currently being evaluated, and we expect to use the ability of BookKeeper to serve a large number of ledgers to share a pool of bookies across applications.

Serving millions of journals with Apache BookKeeperFlavio Junqueira
4:40-6:00pm Lightning Talks
6:00-7:30pm Exhibitor Reception


Day 2 » Thursday, March 21
Tracks:
Applied Hadoop
Operating Hadoop
Hadoop Futures
Integrating Hadoop
8:00-9:00am Coffee and continental breakfast in exhibit hall
9:00-10:30am Plenary Sessions
10:30-11:00am Break
11:00-11:40am
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop and its ecosystem projects Hive and Pig support interactions with data sets of enormous sizes. Petabyte scale data warehouse infrastructures are built on top of Hadoop for providing access to data of massive and small sizes. Hadoop always excelled at large-scale data processing; however, running smaller queries has been problematic due to the batch-oriented nature of the system. With the advent of Hadoop YARN which is a far more general purpose system, we have made tremendous improvements to Hadoop MapReduce. Taken together, the enhancements we have made to the resource management system (YARN), to MapReduce framework and to Hive and Pig themselves, we are elevating the Hadoop ecosystem to be much more powerful, performant and user-friendly. This talk will cover the improvements we have made to YARN, MapReduce, Pig and Hive. We will also walk through the future enhancements we have planned.

Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performanceVinod Kumar Vavilapalli , Gopal Vijayaraghavan
Session Abstract× Close
Watch: VideoSlides

This talk will address valuable lessons learned with the current versions of HBase. There are inherent architectural features that warrant for careful evaluation of the data schema and how to scale out a cluster. The audience will get a best practices summary of where there are limitations in the design of HBase and how to avoid those. In particular, we will discuss issues like proper memory tuning (for reads and writes), optimal flush file sizing, compaction tuning, and the number of write ahead logs required. Further, there is a discussion of the theoretical write performance, in comparison to those observed on real clusters. A collection of cheat sheets and example calculation for cluster sizing rounds out the talk towards the end.

HBase Sizing NotesLars George
Session Abstract× Close
Watch: Slides

The greater promise of Big Data lies not in doing old things in slightly new ways. Instead, it lies in doing new things that were previously not possible. One major class of new things is adding intelligence to large-scale systems. In this session I will present a survey of how machine learning can be applied to real-life situations without having to get a PhD in advanced mathematics. These systems can be built today from open source components to increase business revenues by understanding what customers need and want. I will provide real world examples of best practices and pitfalls in machine learning including practical ways to build maintainable, high performance systems.

Revenue Growth through Machine LearningTed Dunning
Session Abstract× Close
Watch: VideoSlides

Hadoop represents a critical new component in understanding the data most organizations have amassed. Rather than being a challenger to current data warehouse and business intelligence platforms Hadoop is a new tool in the ecosystem. HDInsight brings the power of Hadoop to Microsoft based enterprises. This session covers how HDInsight fits into the Microsoft centric environment and works with other tools like SQL Server, Power Pivot, Power View, and System Center. A specific implementation aggregating and analyzing customer information will be presented along with a review of the experience from both a technical and business owner standpoint.

Hadoop in the Microsoft EnterpriseDan Rosanova
Session Abstract× Close
Watch: VideoSlides

How do you make big data accessible, usable and valuable for everyone? And mine your data for intelligence in minutes and hours, not weeks and months? What about getting real-time insights from your data, even before you persist and replicate it? In this talk, we’ll examine compelling, real-world examples that offer a blueprint for integrating big data technologies (Splunk, Hadoop, RDBMS, Cassandra, Hbase), delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.

Implementing Big Data at the Speed of BusinessRaanan Dagan
11:50-12:20pm
Session Abstract× Close
Watch: VideoSlides

Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. – Master and Region Servers – MemStore and Write Ahead Log (WAL) – HFiles (HBase on disk format) – Compression and Data Block Encoding – LSM Trees and Compactions – Future improvements Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.

HBase Storage Internals, Present and FutureMatteo Bertozzi
Session Abstract× Close
Watch: VideoSlides

Deploying, configuring, and managing large Apache Hadoop and HBase clusters can be quite complex. Upgrading one Hadoop component on a 2000-node cluster can take a lot of time and expertise, and there have been few tools specialized for Hadoop cluster administrators. Apache Ambari is an Apache incubator project to deliver Management and Monitoring functionality for Hadoop clusters. This session presents an overview of Ambari covering how its central master and distributed agents can help in deploying, managing and monitoring multiple Hadoop clusters and scaling to handle 1000+ node clusters. This talk will cover the Ambari Web UI for non-expert usage and future roadmap of Ambari.

Managing your Hadoop clusters with Apache AmbariPramod Thangali , Mahadev Konar
Session Abstract× Close
Watch: Slides

Linear models are some of the most successful methods for predictive analytics. In this talk we give a tutorial on how to learn and apply linear models on big data in practice. We start with a short background on how linear models work. We then show how linear models can be implemented on top of Hadoop: for standard applications we demonstrate how stochastic gradient descent can be implemented easily with map reduce, while advanced applications require some more sophisticated gradient descent functions. We also give tips and tricks on how to improve your models and we illustrate the learning and application process in a live demo.

Learning Linear Models with HadoopUlrich Rueckert
Session Abstract× Close
Watch: Slides

Statistical sampling have established itself in all facets of our live from physics to medical research to presidential elections, still when it comes to Big Data we most frequently favor brute force approach and attempt to process our entire data set ? it?s all or nothing. However we don?t really need to count every single grain of sand at the beach to conclude that it will be a great holiday destination. When we analyze our business performance do we compare every digit of last week 365,514,134 visitors to this week?s 366,364,615 or do we want to know one is 0.2% bigger than the other? Or maybe we can say there is no difference? Properly posing questions to Big Data is the key to reducing overall costs of the data systems and getting information faster while preserving brute force crunching for tasks that really have to count every penny and every drop in the ocean. We will present sampling methodologies useful for Hadoop environments, properly structuring the data for export to non-Hadoop systems, discuss establishing proper sampling rate for different tasks, emphasizing its application to digital marketing and variable sampling rate for properly tracking valuable needles in unimportant haystacks.

Big Data Sampling: a Way to Make all of Your Data Useful AgainMikhail Petrenko
Session Abstract× Close
Watch: VideoSlides

The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. For example, the cloud makes it relatively easy to commission and decommission nodes. We`ve extended Hadoop to scale cluster size depending on workload. Furthermore, the scaling algorithm exploits different pricing models offered by cloud providers. Cloud storage is extremely reliable but has higher latency. We`ve also included I/O optimizations to reduce or eliminate some of these I/O costs. In this talk, we describe how we`ve extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.

Cloud-friendly Hadoop and HiveJoydeep Sen Sharma, Sivaramakrishnan Narayanan
12:20-1:30pm Lunch in Exhibit Hall
1:30pm - 2:10pm
Session Abstract× Close
Watch: VideoSlides

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we`ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life “in the trenches” is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we`ll discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they`re insufficient to provide an overall “big picture” of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows—we refer to this as “plumbing”. We`ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

Scaling Big Data Mining Infrastructure: The Twitter ExperienceJimmy Lin
Session Abstract× Close
Watch: VideoSlides

Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is a dearth of accurate, current information to assist the community in optimally designing and configuring Hadoop platforms (Infrastructure and O/S). For example, how many disks and controllers should you use? Should you buy processors with 4 or 6 cores? Do you need a 1GbE or 10GbE Network? Should you use SATA or MDL SAS? Small or Large Form Factor Disks? How much memory do you need ? How do you characterize your Hadoop workloads to figure out whether your are I/O, CPU, Network or Memory bound? How does one optimize Linux performance, reliability and availability for Hadoop? In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.

Optimizing your Infrastructure and Operating Systems for HadoopSteve Watt
Session Abstract× Close
Watch: Slides

This session provides details on how comScore uses Hadoop to process over 1.4 trillion internet and mobile events per day to understand, analyze and produce information on what is happening on the Web worldwide. The talk will highlight the use of Hadoop to determine how activities at web sites translate into real user behaviors. Attendees will gain insight into how comScore has used Hadoop to handle the scalability needs of its Validated Campaign Essentials product. The talk will also detail how algorithms running on top of Hadoop combine information to develop broader insights Internet usage.

Analyzing 1.4 Trillion events with HadoopMichael Brown
Session Abstract× Close
Watch: VideoSlides

Extracting value from Big Data is not easy.  The field of technologies and vendors is fragmented and rapidly evolving.  End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception.  And most companies lack Big Data specialists.  The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) ///  lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.

There is a long list of crucial questions to think about.  How fast is the data flying at you?  Are your Big Data analyses tightly integrated with existing systems?  Or parallel and complex?   Can you tolerate a minute of latency?  Do you accept data loss or generous SLAs?  Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.

This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.

Don’t be Hadooped When Looking for Big Data ROI: How Use Case Segmentation drives Target Architectures and Technology Selection at Deutsche TelekomJuergen Urbanski
Session Abstract× Close
Watch: VideoSlides

HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on data, while connecting to the most widely used Business Intelligence (BI) tools on the planet such as excel and PowerPivot. This presentation looks at core components of HDInsight and Integration with Microsoft BI tools.

Introduction to Microsoft HDInsights and BI toolsAbhijit Lele, Rohit Bakshi
2:20-3:00pm
Session Abstract× Close
Watch: VideoSlides

Hadoop Distributed Filesystem (HDFS) is one of the core storage solutions in use at Facebook. One of the most notable use cases of HDFS at Facebook is our Hive data warehouse, used for collecting Facebook users behaviors from the front-end. The warehouse cluster stores more than 100PB of data, with 500+ terabytes of data entered into the clusters every day. To meet the capacity requirement of future data growth, storing data in a cost-effective way becomes a top priority because a petabyte of disk space saved translates to hundreds of thousands of dollars of savings. This talk will present various solutions we use to reduce our warehouse cluster`s data footprint: (1) Smart retention: suggest hive table retention modification automatically based on partition access history; (2) Sort hive partitions using selective columns to increase RCFile compression ratio; (3) HDFS file-level raiding to reduce the replication factor of warm and cold large files from 3 to a much lower ratio using XOR Code and Reed Solomon Code (4) Raiding millions of warm small files at the directory level (5) Compact cold small files into large files in a raid-aware way to achieve the most replication factor reduction from file-level raiding. We will discuss in detail how each technique works, the challenges faced, lessons learned during deployment, and finally the results we have achieved.

Facebook's approach to big data storage challengeWeiyan Wang
Session Abstract× Close
Watch: VideoSlides

This presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.

Best Practices for Virtualizing HadoopGeorge Trujillo
Session Abstract× Close
Watch: Slides

Big Data hype is everywhere – and to take some of the more breathless commentary at face value is to believe that over 30 years of information management best-practice has been rendered obsolete and irrelevant, almost overnight.

The reality is, of course, more complex; whilst new technologies and new types of analysis are already and demonstrably creating incredible new sources of value and competitive advantage for leading organisations, “traditional” Business Intelligence and Analytics also continue to evolve apace and are no less important and no less critical.  And all technologies, whether shiny and new or mature and established, have strengths – and weaknesses.

In this presentation, we will argue that as IT professionals charged with charting a course through the hype our goal must be to enable ordinary end-users in our organizations to run any analytic, on any data at any time – and that to realise this goal will require us to deploy and transparently integrate multiple data management and analytic technologies in a “Unified Data Architecture”.  We will present real-world use-cases that illustrate both how the new technologies are already creating value – and how they can be successfully combined with existing technology assets to even greater effect.  Lastly, we will present a “big data manifesto” that summarises the challenges that the industry will need to embrace if we are to industrialise “big data analytics” over the course of the next several decades in the same way that we have successfully industrialised “traditional analytics” during the last three decades.

Dancing With The ElephantMartin Willcox, Chris Hillman
Session Abstract× Close
Watch: Slides

There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.

Splout SQL: When Big Data output is also Big Data - A richer, open-source database "spout" for HadoopIvan Prado Alonso
Session Abstract× Close
Watch: VideoSlides

Many data processing tasks can be thought as small mutations to a large database triggered by events. Contrary to batch processing, the incremental processing model achieves very low delay from the reception of an event to the application of the mutation. In the absence of transactions support, developers have to use ad-hoc mechanisms to ensure atomic execution of mutations despite failures and concurrent accesses to the database by other clients. Most NoSQL data stores, like HBase, BigTable and Cassandra, lack support of transactions, which makes them unsuitable for a whole range of applications. In this talk we present Omid, an open source tool for transactional support and incremental processing on top of HBase (https://github.com/yahoo/omid). Due to the centralized nature of its client-replicated status oracle, Omid (i) avoids distributed locks, (ii) scales up to 60,000 TPS and a thousand clients, (iii) requires no changes to HBase, and (vi) adds a negligible overhead to data servers.

Omid: Efficient Transaction Management and Incremental Processing for HBaseDaniel Gomez Ferro
3:10-3:50pm
Session Abstract× Close
Watch: VideoSlides

eBay has grown into one of the largest online marketplaces on the internet today, serving more than 100 million active users. The number of items listed on eBay and the fact that these items are sold across different channels make analytics a challenging proposition for this level of scale. In this session, hear the aspects of analytics that present challenges for performance and scalability and core architectural components and design principles that eBay has used to address these challenges. In addition, learn about how Hadoop is planned to be used for building cost-effective high-performance and scalable analytics applications.

Powerful Analytics Apps Fueled by Hadoop for High Performance and ScalabilityAmit Rustagi
Session Abstract× Close
Watch: VideoSlides

Windows Azure HDInsight Service lets you embrace Hadoop, enabling you to seamlessly manage data of any type or size. Discover how to provision a Hadoop cluster on Windows Azure in minutes with easy management and monitoring. Take advantage of the elastic scale of Windows Azure. Explore a variety of developer tools from Java to JavaScript to develop for HDInsight Service. Finally learn how everyone can easily glean insights from all their data, whether structured or unstructured, through familiar tools like Excel.

Drive Smarter Decisions with Hadoop and Windows Azure HDInsight ServiceMatt Winkler
Session Abstract× Close
Watch: VideoSlides

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

Optimizing Hive QueriesOwen O'Malley
Session Abstract× Close
Watch: Slides

The Data Warehouse has been a staple in data-driven organizations for decades. As a result, the ecosystem, architecture, processes and methodologies around data warehousing is extremely mature. The arrival of Hadoop and Big Data has brought new life into traditional data warehousing by proposing new architectures and processes that upend existing norms. This presentation goes over several variants of how Hadoop interplays with existing data warehouses to solve modern problems.

Hadoop and the Enterprise Data WarehousePatrick Angeles
Session Abstract× Close
Watch: VideoSlides

Apache Hive is Hadoop’s SQL-like interface, used for reporting and analysis over huge volumes of data. Hive was released by Facebook in 2009 and is now used there to run more than 60,000 queries per day over more than 100 petabytes of data. Hundreds of companies use Hive in production for its reliable data processing and unmatched scale. Community activity in Hive is greater than ever before and 2013 is full of exciting new developments for Hive in both performance and analytics capabilities.

Come to this session to:
* Learn about how “Project Stinger” will achieve its goal to make Hive 100x faster than it has been in the past, enabling both more scalable analytics and human-time query
* Learn about Hive’s new analytical capabilities, windowing functions and standard SQL datatypes

What's New and What's Next in Apache HiveGunther Hagleitner
Print Program