| 8:00-9:00am |
Coffee and continental breakfast in exhibit hall |
| 9:00-10:30am |
Plenary Sessions |
| 10:30-11:00am |
Break |
| 11:00-11:40am |
Session Abstract× CloseWatch: VideoSlidesApache Hadoop and its ecosystem projects Hive and Pig support interactions with data sets of enormous sizes. Petabyte scale data warehouse infrastructures are built on top of Hadoop for providing access to data of massive and small sizes. Hadoop always excelled at large-scale data processing; however, running smaller queries has been problematic due to the batch-oriented nature of the system. With the advent of Hadoop YARN which is a far more general purpose system, we have made tremendous improvements to Hadoop MapReduce. Taken together, the enhancements we have made to the resource management system (YARN), to MapReduce framework and to Hive and Pig themselves, we are elevating the Hadoop ecosystem to be much more powerful, performant and user-friendly. This talk will cover the improvements we have made to YARN, MapReduce, Pig and Hive. We will also walk through the future enhancements we have planned.
Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performanceVinod Kumar Vavilapalli , Gopal Vijayaraghavan |
Session Abstract× CloseWatch: VideoSlidesThis talk will address valuable lessons learned with the current versions of HBase. There are inherent architectural features that warrant for careful evaluation of the data schema and how to scale out a cluster. The audience will get a best practices summary of where there are limitations in the design of HBase and how to avoid those. In particular, we will discuss issues like proper memory tuning (for reads and writes), optimal flush file sizing, compaction tuning, and the number of write ahead logs required. Further, there is a discussion of the theoretical write performance, in comparison to those observed on real clusters. A collection of cheat sheets and example calculation for cluster sizing rounds out the talk towards the end.
HBase Sizing NotesLars George |
Session Abstract× CloseWatch: SlidesThe greater promise of Big Data lies not in doing old things in slightly new ways. Instead, it lies in doing new things that were previously not possible. One major class of new things is adding intelligence to large-scale systems. In this session I will present a survey of how machine learning can be applied to real-life situations without having to get a PhD in advanced mathematics. These systems can be built today from open source components to increase business revenues by understanding what customers need and want. I will provide real world examples of best practices and pitfalls in machine learning including practical ways to build maintainable, high performance systems.
Revenue Growth through Machine LearningTed Dunning |
Session Abstract× CloseWatch: VideoSlidesHadoop represents a critical new component in understanding the data most organizations have amassed. Rather than being a challenger to current data warehouse and business intelligence platforms Hadoop is a new tool in the ecosystem. HDInsight brings the power of Hadoop to Microsoft based enterprises. This session covers how HDInsight fits into the Microsoft centric environment and works with other tools like SQL Server, Power Pivot, Power View, and System Center. A specific implementation aggregating and analyzing customer information will be presented along with a review of the experience from both a technical and business owner standpoint.
Hadoop in the Microsoft EnterpriseDan Rosanova |
Session Abstract× CloseWatch: VideoSlidesHow do you make big data accessible, usable and valuable for everyone? And mine your data for intelligence in minutes and hours, not weeks and months? What about getting real-time insights from your data, even before you persist and replicate it? In this talk, we’ll examine compelling, real-world examples that offer a blueprint for integrating big data technologies (Splunk, Hadoop, RDBMS, Cassandra, Hbase), delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.
Implementing Big Data at the Speed of BusinessRaanan Dagan |
| 11:50-12:20pm |
Session Abstract× CloseWatch: VideoSlidesApache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. – Master and Region Servers – MemStore and Write Ahead Log (WAL) – HFiles (HBase on disk format) – Compression and Data Block Encoding – LSM Trees and Compactions – Future improvements Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
HBase Storage Internals, Present and FutureMatteo Bertozzi |
Session Abstract× CloseWatch: VideoSlidesDeploying, configuring, and managing large Apache Hadoop and HBase clusters can be quite complex. Upgrading one Hadoop component on a 2000-node cluster can take a lot of time and expertise, and there have been few tools specialized for Hadoop cluster administrators. Apache Ambari is an Apache incubator project to deliver Management and Monitoring functionality for Hadoop clusters. This session presents an overview of Ambari covering how its central master and distributed agents can help in deploying, managing and monitoring multiple Hadoop clusters and scaling to handle 1000+ node clusters. This talk will cover the Ambari Web UI for non-expert usage and future roadmap of Ambari.
Managing your Hadoop clusters with Apache AmbariPramod Thangali , Mahadev Konar |
Session Abstract× CloseWatch: SlidesLinear models are some of the most successful methods for predictive analytics. In this talk we give a tutorial on how to learn and apply linear models on big data in practice. We start with a short background on how linear models work. We then show how linear models can be implemented on top of Hadoop: for standard applications we demonstrate how stochastic gradient descent can be implemented easily with map reduce, while advanced applications require some more sophisticated gradient descent functions. We also give tips and tricks on how to improve your models and we illustrate the learning and application process in a live demo.
Learning Linear Models with HadoopUlrich Rueckert |
Session Abstract× CloseWatch: SlidesStatistical sampling have established itself in all facets of our live from physics to medical research to presidential elections, still when it comes to Big Data we most frequently favor brute force approach and attempt to process our entire data set ? it?s all or nothing. However we don?t really need to count every single grain of sand at the beach to conclude that it will be a great holiday destination. When we analyze our business performance do we compare every digit of last week 365,514,134 visitors to this week?s 366,364,615 or do we want to know one is 0.2% bigger than the other? Or maybe we can say there is no difference? Properly posing questions to Big Data is the key to reducing overall costs of the data systems and getting information faster while preserving brute force crunching for tasks that really have to count every penny and every drop in the ocean. We will present sampling methodologies useful for Hadoop environments, properly structuring the data for export to non-Hadoop systems, discuss establishing proper sampling rate for different tasks, emphasizing its application to digital marketing and variable sampling rate for properly tracking valuable needles in unimportant haystacks.
Big Data Sampling: a Way to Make all of Your Data Useful AgainMikhail Petrenko |
Session Abstract× CloseWatch: VideoSlidesThe cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. For example, the cloud makes it relatively easy to commission and decommission nodes. We`ve extended Hadoop to scale cluster size depending on workload. Furthermore, the scaling algorithm exploits different pricing models offered by cloud providers. Cloud storage is extremely reliable but has higher latency. We`ve also included I/O optimizations to reduce or eliminate some of these I/O costs. In this talk, we describe how we`ve extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.
Cloud-friendly Hadoop and HiveJoydeep Sen Sharma, Sivaramakrishnan Narayanan |
| 12:20-1:30pm |
Lunch in Exhibit Hall |
| 1:30pm - 2:10pm |
Session Abstract× CloseWatch: VideoSlidesThe analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we`ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life “in the trenches” is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we`ll discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they`re insufficient to provide an overall “big picture” of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows—we refer to this as “plumbing”. We`ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
Scaling Big Data Mining Infrastructure: The Twitter ExperienceJimmy Lin |
Session Abstract× CloseWatch: VideoSlidesApache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is a dearth of accurate, current information to assist the community in optimally designing and configuring Hadoop platforms (Infrastructure and O/S). For example, how many disks and controllers should you use? Should you buy processors with 4 or 6 cores? Do you need a 1GbE or 10GbE Network? Should you use SATA or MDL SAS? Small or Large Form Factor Disks? How much memory do you need ? How do you characterize your Hadoop workloads to figure out whether your are I/O, CPU, Network or Memory bound? How does one optimize Linux performance, reliability and availability for Hadoop? In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Optimizing your Infrastructure and Operating Systems for HadoopSteve Watt |
Session Abstract× CloseWatch: SlidesThis session provides details on how comScore uses Hadoop to process over 1.4 trillion internet and mobile events per day to understand, analyze and produce information on what is happening on the Web worldwide. The talk will highlight the use of Hadoop to determine how activities at web sites translate into real user behaviors. Attendees will gain insight into how comScore has used Hadoop to handle the scalability needs of its Validated Campaign Essentials product. The talk will also detail how algorithms running on top of Hadoop combine information to develop broader insights Internet usage.
Analyzing 1.4 Trillion events with HadoopMichael Brown |
Session Abstract× CloseWatch: VideoSlidesExtracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.
There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.
This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.
Don’t be Hadooped When Looking for Big Data ROI: How Use Case Segmentation drives Target Architectures and Technology Selection at Deutsche TelekomJuergen Urbanski |
Session Abstract× CloseWatch: VideoSlidesHDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on data, while connecting to the most widely used Business Intelligence (BI) tools on the planet such as excel and PowerPivot. This presentation looks at core components of HDInsight and Integration with Microsoft BI tools.
Introduction to Microsoft HDInsights and BI toolsAbhijit Lele, Rohit Bakshi |
| 2:20-3:00pm |
Session Abstract× CloseWatch: VideoSlidesHadoop Distributed Filesystem (HDFS) is one of the core storage solutions in use at Facebook. One of the most notable use cases of HDFS at Facebook is our Hive data warehouse, used for collecting Facebook users behaviors from the front-end. The warehouse cluster stores more than 100PB of data, with 500+ terabytes of data entered into the clusters every day. To meet the capacity requirement of future data growth, storing data in a cost-effective way becomes a top priority because a petabyte of disk space saved translates to hundreds of thousands of dollars of savings. This talk will present various solutions we use to reduce our warehouse cluster`s data footprint: (1) Smart retention: suggest hive table retention modification automatically based on partition access history; (2) Sort hive partitions using selective columns to increase RCFile compression ratio; (3) HDFS file-level raiding to reduce the replication factor of warm and cold large files from 3 to a much lower ratio using XOR Code and Reed Solomon Code (4) Raiding millions of warm small files at the directory level (5) Compact cold small files into large files in a raid-aware way to achieve the most replication factor reduction from file-level raiding. We will discuss in detail how each technique works, the challenges faced, lessons learned during deployment, and finally the results we have achieved.
Facebook's approach to big data storage challengeWeiyan Wang |
Session Abstract× CloseWatch: VideoSlidesThis presentation will discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Best Practices for Virtualizing HadoopGeorge Trujillo |
Session Abstract× CloseWatch: SlidesBig Data hype is everywhere – and to take some of the more breathless commentary at face value is to believe that over 30 years of information management best-practice has been rendered obsolete and irrelevant, almost overnight.
The reality is, of course, more complex; whilst new technologies and new types of analysis are already and demonstrably creating incredible new sources of value and competitive advantage for leading organisations, “traditional” Business Intelligence and Analytics also continue to evolve apace and are no less important and no less critical. And all technologies, whether shiny and new or mature and established, have strengths – and weaknesses.
In this presentation, we will argue that as IT professionals charged with charting a course through the hype our goal must be to enable ordinary end-users in our organizations to run any analytic, on any data at any time – and that to realise this goal will require us to deploy and transparently integrate multiple data management and analytic technologies in a “Unified Data Architecture”. We will present real-world use-cases that illustrate both how the new technologies are already creating value – and how they can be successfully combined with existing technology assets to even greater effect. Lastly, we will present a “big data manifesto” that summarises the challenges that the industry will need to embrace if we are to industrialise “big data analytics” over the course of the next several decades in the same way that we have successfully industrialised “traditional analytics” during the last three decades.
Dancing With The ElephantMartin Willcox, Chris Hillman |
Session Abstract× CloseWatch: SlidesThere are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.
Splout SQL: When Big Data output is also Big Data - A richer, open-source database "spout" for HadoopIvan Prado Alonso |
Session Abstract× CloseWatch: VideoSlidesMany data processing tasks can be thought as small mutations to a large database triggered by events. Contrary to batch processing, the incremental processing model achieves very low delay from the reception of an event to the application of the mutation. In the absence of transactions support, developers have to use ad-hoc mechanisms to ensure atomic execution of mutations despite failures and concurrent accesses to the database by other clients. Most NoSQL data stores, like HBase, BigTable and Cassandra, lack support of transactions, which makes them unsuitable for a whole range of applications. In this talk we present Omid, an open source tool for transactional support and incremental processing on top of HBase (https://github.com/yahoo/omid). Due to the centralized nature of its client-replicated status oracle, Omid (i) avoids distributed locks, (ii) scales up to 60,000 TPS and a thousand clients, (iii) requires no changes to HBase, and (vi) adds a negligible overhead to data servers.
Omid: Efficient Transaction Management and Incremental Processing for HBaseDaniel Gomez Ferro |
| 3:10-3:50pm |
Session Abstract× CloseWatch: VideoSlideseBay has grown into one of the largest online marketplaces on the internet today, serving more than 100 million active users. The number of items listed on eBay and the fact that these items are sold across different channels make analytics a challenging proposition for this level of scale. In this session, hear the aspects of analytics that present challenges for performance and scalability and core architectural components and design principles that eBay has used to address these challenges. In addition, learn about how Hadoop is planned to be used for building cost-effective high-performance and scalable analytics applications.
Powerful Analytics Apps Fueled by Hadoop for High Performance and ScalabilityAmit Rustagi |
Session Abstract× CloseWatch: VideoSlidesWindows Azure HDInsight Service lets you embrace Hadoop, enabling you to seamlessly manage data of any type or size. Discover how to provision a Hadoop cluster on Windows Azure in minutes with easy management and monitoring. Take advantage of the elastic scale of Windows Azure. Explore a variety of developer tools from Java to JavaScript to develop for HDInsight Service. Finally learn how everyone can easily glean insights from all their data, whether structured or unstructured, through familiar tools like Excel.
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight ServiceMatt Winkler |
Session Abstract× CloseWatch: VideoSlidesApache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Optimizing Hive QueriesOwen O'Malley |
Session Abstract× CloseWatch: SlidesThe Data Warehouse has been a staple in data-driven organizations for decades. As a result, the ecosystem, architecture, processes and methodologies around data warehousing is extremely mature. The arrival of Hadoop and Big Data has brought new life into traditional data warehousing by proposing new architectures and processes that upend existing norms. This presentation goes over several variants of how Hadoop interplays with existing data warehouses to solve modern problems.
Hadoop and the Enterprise Data WarehousePatrick Angeles |
Session Abstract× CloseWatch: VideoSlidesApache Hive is Hadoop’s SQL-like interface, used for reporting and analysis over huge volumes of data. Hive was released by Facebook in 2009 and is now used there to run more than 60,000 queries per day over more than 100 petabytes of data. Hundreds of companies use Hive in production for its reliable data processing and unmatched scale. Community activity in Hive is greater than ever before and 2013 is full of exciting new developments for Hive in both performance and analytics capabilities.
Come to this session to:
* Learn about how “Project Stinger” will achieve its goal to make Hive 100x faster than it has been in the past, enabling both more scalable analytics and human-time query
* Learn about Hive’s new analytical capabilities, windowing functions and standard SQL datatypes
What's New and What's Next in Apache HiveGunther Hagleitner |