| Day 1 » Wednesday, June 26thGo to Day 2 | |||||||
|---|---|---|---|---|---|---|---|
|
Tracks:
Hadoop Driven Business / Business Intelligence Reference Architectures Hadoop (Disruptive) Economics Future of Apache Hadoop Enterprise Data Architecture Deployment and Operations Applications and Data Science |
|||||||
| 7:30am - 8:30am | Breakfast | ||||||
| 8:30am - 10:30am | Keynote & Plenary | ||||||
| 10:30am - 11:00am | Break - Coffee and pastries | ||||||
| 11:00am - 11:40am | Session Abstract× Close Storm-on-YARN: Convergence of Low-Latency and Big-DataAndrew FengHadoop plays a central role for Yahoo! to provide personalized experiences for our users and create value for our advertisers. In this talk, we will discuss the convergence of low-latency processing and Hadoop platform. To enable the convergence, we have developed Storm-on-YARN to enable Storm streaming/microbatch applications and Hadoop batch applications hosted in a single cluster. Storm applications could leverage YARN for resource management, and apply Hadoop style security to Hadoop datasets on HDFS and HBase. In Storm-on-YARN, YARN is used to launch Storm application master (Nimbus), and enable Nimbus to request resources for Storm workers (Supervisors). YARN resource manager and Storm scheduler work together to support multi-tenancy and high availability. HDFS enables Storm to achieve higher availability of Nimbus itself. We are introducing Hadoop style security into Storm through JAAS authentication (Kerberos and Digest). Storm servers (Nimbus and DRPC) will be configured with authorization plugins for access control and audit. The security context enables Storm applications to access authorized datasets only (including those created by Hadoop applications). Yahoo! is making our contribution on Storm and YARN available as open source. We will work with industry partners to foster the convergence of low-latency processing and big-data. |
Session Abstract× Close Genie - Hadoop Platform as a Service at NetflixSriram KrishnanRecently in our tech-blog, we discussed the architecture of our petabyte-scale data warehouse in the cloud (http://nflx.it/XoySYR). Salient features include the use of Amazon`s Simple Storage Service (S3) as our “source of truth”, leveraging the elasticity of the cloud to run multiple dynamically-resizable Hadoop clusters to support various workloads, and our implementation of a horizontally-scalable Hadoop Platform as a Service called ?Genie?. In this presentation, we will focus on Genie, which provides job and resource management for the Hadoop ecosystem in the cloud, and is the core service that the various components of the enterprise ecosystem at Netflix use to integrate with Hadoop in the cloud. From the perspective of the end-user, Genie abstracts away the physical details of various (potentially transient) Hadoop resources in the cloud, and provides REST-ful APIs to submit and monitor Hadoop, Hive and Pig jobs without having to install any Hadoop clients. We will describe how Genie is used in production at Netflix for processing 100s of terabytes of data everyday, running thousands of ETL (extract, transform, load) jobs, plus hundreds of ad-hoc jobs from our visualization tools and our web interface. Finally, we will discuss our plans for open sourcing Genie. |
Session Abstract× Close Top Ten things to get the most out of your Hadoop clusterSanjay Radia, Suresh SrinivasThis talk describes top ten things that make it easier to run and manage your Hadoop system in production. We start with configurations, best practices in planning and setting up Hadoop clusters for reliability and efficiency. We include typical machine sizing and the tradeoffs of big vs small servers relative to cluster size. We cover how to implement a cluster for multi-tenancy with an eye on isolation and sharing cluster resources. Next we describe the tools available for managing the cluster, such as decommissioning, balancer, and metrics. We include best practices for monitoring a cluster and dealing with different kinds of failures. In particular we emphasise differences from traditional data center server management especially when dealing with failures of disks and nodes. We go over how to use the tools available for backup, Disaster Recovery and Archiving. We concluded with how to cope with storage and computation growth that Hadoop production clusters typically see. These lessons and tips have been derived from our extensive experience in running production Hadoop clusters and supporting customers over the last six years. We share anecdotes and real life incidents throughout the talk. |
Session Abstract× Close Securing the Hadoop EcosystemAaron T. Myers, Alejandro AbdelnurNowadays a typical Hadoop deployment consists of core Hadoop components – HDFS and MapReduce – several other components such as HBase, HttpFS, Oozie, Pig, Hive, Sqoop, Flume, plus programmatic integration from external systems and applications. This effectively creates a complex and heterogenous distributed environment that runs across several machines and uses different protocols to communicate with each other; all of which is used concurrently by several users and applications. When a Hadoop deployment and its ecosystem is used to process sensitive data (such as financial records, payment transactions, healthcare records), several security requirements arise. These security requirements may be dictated by internal policies and/or government regulations. They may require strong authentication, selective authorization to access data/resources, and data confidentiality. This session covers in detail how different components in the Hadoop ecosystem and external applications can interact with each other in a secure manner providing authentication, authorization, and confidentiality when accessing services and transferring data to/from/between services. The session will cover topics like Kerberos authentication, Web UI authentication, File System permissions, delegation tokens, Access Control Lists, ProxyUser impersonation and network encryption. |
Session Abstract× Close Asking the Right Questions of our Big Data Mike Peterson, Anand VenugopalExecutives are still waiting on our “Big Data Deep Insights”. Many of us are down the path of collecting, extracting, and analyzing our ever-growing data in Hadoop environments. We are building our data science expertise and expanding data governance. Yet still we are not getting what we are waiting for.This talk is about: |
Session Abstract× Close Enabling R on HadoopPaul Codding, Ravi MutyalaHadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform. |
|
| 11:40am - 11:50am | Break | ||||||
| 11:50am - 12:30pm | Session Abstract× Close LinkedIn Member Segmentation Platform: A Big Data ApplicationHien LuuCreating member segmentations is one of the main functions of a marketing team at any Internet company. Marketing teams are constantly creating various member segments to tailor to the needs of marketing campaigns and these needs change frequently. Therefore there is a huge need for a self-service member segmentation platform that is easy to use and scalable to support large member data set. This presentation will go into the architecture of the LinkedIn Member Segmentation platform and how it leverages Hadoop technologies like Apache Pig, Apache Hive and enterprise data warehouse system like Teradata to provide a self-service way to create and manage member segmentations. In addition, it will also cover some of the interesting challenges and lessons learned from building this platform. |
Session Abstract× Close Hadoop in LoveVaclav PetricekeHarmony was founded to give people a better chance to find someone for a long lasting relationship. As one of the first companies we have applied advanced technology what became known as Data Science these days to the age old problem of matchmaking. Over the years eHarmony has accumulated vast amount of data on variety of romantic interactions. This data a is a treasure trove of entertaining tidbits and nuggets of insight into human nature. I will share some of those in hope that people may find them useful but more importantly I will also demonstrate how we actually use this data to make recommendations and give single people an upper hand in finding “The One”. In particular I will show how we utilize hadoop (YARN) to process billions of pairs of user profiles to find ngrams and other features that are predictive of romantic attraction and how we use the features discovered for large scale machine learning using vowpal wabbit`s allreduce parallell learning. Finally I am going to describe an optimization technique that decides what matches to deliver to who and when but which is more broadly aplicable to other domains such as advertising or constrained recommendations. |
Session Abstract× Close Managing your Hadoop clusters with Apache AmbariYusaku Sako, Jeff SposettiDeploying, configuring, and managing large Apache Hadoop and HBase clusters can be quite complex. Once you have your clusters, keeping them up and running and making sure that the SLAs are met presents even more challenges and headaches to Hadoop operators. To make matters worse, managing upgrades can be a nightmare. Hadoop users are presented with their own fair share of difficulties such as slow running jobs and not knowing why they are slow. For third-party software vendors interested in incorporating Hadoop management and monitoring capabilities, there does not seem to be an obvious, easy solution. Apache Ambari is aimed at making lives of Hadoop operators, users, and integrators simpler by providing a management interface to do all of that and more. This session presents usages of Ambari`s Web UI for Hadoop operators (deploying, managing, and monitoring) as well as Hadoop users (job analytics). The talk will also touch upon Ambari`s REST API and how it is used in the real world. The session concludes by revealing the future roadmap of Ambari including queue management, upgrade, disaster recovery, high availability, and more. |
Session Abstract× Close Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationPhil Shelley, Sunil KakadeIn spite of recent advances in computing, many core business processes are batch-oriented running on Mainframes. Annual Mainframe costs are counted in 6+ figure Dollars per year, potentially growing with capacity needs. In order to tackle the cost challenge, many organizations have considered or attempted multi-year mainframe migration/re-hosting strategies. Traditional approaches to Mainframe elimination call for large initial investments and carry significant risks – It is hard to match Mainframe performance and reliability. Using Hadoop, Sears/MetaScale developed an innovative alternative that enables batch processing migration to Hadoop, without the risks, time and costs of other methods. This solution has been adopted in multiple businesses with excellent results and associated cost savings, as Mainframes are physically eliminated or downsized: Millions of dollars in savings based on MIP reductions have been seen – A reduction of 200 MIPS can yield $1 million in annual savings. MetaScale eliminated over 900 MIPs and an entire Mainframe system for one fortune 500 client. This presentation illustrates reference architecture and approach successfully used by MetaScale to move mainframe processing to the Hadoop platform without altering user-facing business applications. |
Session Abstract× Close High Speed Continuous & Reliable Data Ingest into HadoopOleg Zhurakousky, Tom McCuchThis talk will explore the area of real-time data ingest into Hadoop and present the architectural trade-offs as well as demonstrate alternative implementations that strike the appropriate balance across the following common challenges: * Decentralized writes (multiple data centers and collectors) * Continuous Availability, High Reliability * No loss of data * Elasticity of introducing more writers * Bursts in Speed per syslog emitter * Continuous, real-time collection * Flexible Write Targets (local FS, HDFS etc.) |
Session Abstract× Close Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedGlenn TamkinScientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyberinfrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model intercomparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned. |
|
| 12:30pm - 1:45pm | Lunch | ||||||
| 1:45pm - 2:55pm | Session Abstract× Close Panel: When Worlds Collide: SQL meets HadoopTony BaerWhile Hadoop is evolving in many directions, in the Hadoop platform market, the hotspot today is convergence with SQL. While Hadoop projects such as Hive (and HQL) and Pig were designed with SQL in mind, vendors are rapidly planting their stakes, with many of them going their own ways. This brings new – and all-too-familiar – questions regarding the boundary between open source and proprietary (or vendor-specific) technologies, concerns about vendor lock-in, and whether the Hadoop-SQL interface will fork in the same way as SQL itself. With all of these approaches, the natural question for customers to ask is why is one better than the other. On the other side of the debate is the question of whether all this activity is necessary. Specifically, will the rush to Hadoop distract enterprises from realizing the true value of Hadoop – the ability to explore, analyze, and query new kinds of data, in new ways. We will throw these and other questions to the players who are driving Hadoop vendor SQL strategies. And before the session, we will invite you to tweet your own questions to us using the dual hashtags #HadoopSummit13 and #SQLcollide. Our panel will includes representatives from: Cloudera, IBM Software Group, Hortonworks, Pivotal, Datameer |
Session Abstract× Close Agile Data: Building Hadoop Analytics ApplicationsRussell JurneyMining data requires a deep investment in people and time. How can you be sure you’re building the right models? What tools help you connect with the customer’s needs? With this hands-on presentation, you’ll learn a flexible toolset and methodology for building effective analytics applications. Agile Data (the book) shows you how to create an environment for exploring data, using lightweight tools such as Python, Apache Pig, and the D3.js (Data-Driven Documents) JavaScript library. You’ll learn an iterative approach that allows you to quickly change the kind of analysis you’re doing, as you discover what the data is telling you. All the example code in this book is available as working web applications. We will cover how to: * Build an application to mine your own email inbox * Use different data structures and algorithms to extract multiple features from a single dataset, and learn how different perspectives can yield insight * Rapidly boot your applications as simple front-ends to a document store * Add features driven by descriptive and inferential statistics, machine learning, and data visualization * Gather usage data and talk to real users to help guide your data-driven exploration |
Session Abstract× Close Hadoop Graph Processing with Apache GiraphJay TangPayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges. |
Session Abstract× Close Demystifying Systems for Interactive and Real-time AnalyticsShivnath BabuA number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented. |
Session Abstract× Close Big Data Transformation Method and PracticeTony ShanThis presentation discusses a Big Data transformation method, which provides a step-by-step approach to migrating an existing application to a NoSQL solution. This holistic framework comprises several sequential components: current state assessment, pain point analysis, future state formulation, phased roadmap, platform comparison, Hadoop distribution selection, prototyping, benchmarking, migration, tuning, and operation. The details of input, output and activity in each stage are defined in the model. Next, to illustrate the effective use of this framework, we will walk through a real-world implementation case of a quality assurance product, which conducts concurrent analysis of large-volume quality data in the supply chain domain. A variety of patterns and techniques will be discussed to help justify the tradeoffs and accelerate the transition, such as NoSQL rationalization, decoupling for data access, bulk loading to HBase, hybrid of transactions and predictive analytics, and aggregation. Best practices and lessons learned will be shared as well during the session. |
Session Abstract× Close How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million.Rob RosenA Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design considerations, as well as the economics behind ETL offload with Hadoop. Additional information about how the Hadoop platform was leveraged to support extended analytics will also be referenced. |
|
| 2:55pm - 3:05pm | Break | ||||||
| 3:05pm - 3:45pm | Session Abstract× Close HDFS - What is New and FutureSanjay RadiaThe current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS. |
Session Abstract× Close Should I use Scalding or Scoobi or Scrunch?Christopher SeversIn the past year there has been a tremendous amount of activity on Scala APIs for Hadoop. In this talk we`ll talk about writing Map/Reduce jobs in a more functional manner and explore the three most popular Scala packages for Hadoop: Scalding, Scoobi and Scrunch. Detailed usage examples will be provided for each along with some real world use cases. |
Session Abstract× Close Trends in Supporting Production Apache HBase ClustersJonathan Hsieh, Kevin O`DellApache HBase is a distributed data store that is in production today at many enterprises and sites serving large volumes of near-real-time random-accesses. By supporting a wide range of production Apache HBase clusters with diverse use cases and sizes over the past year, we?ve noticed several new trends, learned lessons, and taken action to improve the HBase experience. We?ll present aggregated root-cause statistics on resolved support tickets from the past year. The comparison between this and the previous year?s shows an interesting shift away from problems internal to HBase (splitting, repairs, recovery time) that skews towards user-inflicted problems like poor application architecture level that can be mitigated by tuning (bulk load, r/w latencies and compaction policies). The talk will discuss several tuning tips used for a variety of production workloads running on top of the HBase 0.92.x/0.94.x clusters with 10s to 100s of nodes. This will include settings and their justification for sizing clusters, tuning bulk loads, region counts, and memory settings. We?ll also discuss recently added HBase features that alleviate these problems including an improved mean time to recovery, improved predictability, and improved performance. |
Session Abstract× Close Business Rules on HadoopEgor GryaznovBusiness rules are widely used by enterprises in order to apply logic to their constantly growing data sets. There are many business rule management systems (BRMS) that facilitate this process, however they take a long time in order to process large scale datasets. Today, with information volumes measured in terabytes, standalone business rule engines are simply cannot keep up. With the advent of distributed computing technologies, such as Hadoop, performing jobs in parallel has become a much simpler and less stressful task. Many business rules are ?embarrassingly parallel?, which makes them perfect candidates for running in a parallel computing environment. This is due to the property of most rules to rely simply on a single record to execute and enrich that specific record. Even the business rules that do not have this property can be adapted to run in a parallel environment. In this presentation, I will use the Drools BRMS to show how to utilize Hadoop and the MapReduce paradigm in order to scale business rules to massive datasets. |
Session Abstract× Close Sharing resources with non-Hadoop workloadsMatthew FarrelleeEnterprise data centers house numerous workloads. With Hadoop growing in these data centers, IT departments need tools to avoid creating silos, while maintaining SLAs, reporting and charge-back requirements. We present a completely open source reference architecture including Apache Hadoop, Linux cgroups and namespace isolation, Gluster and HTCondor. Topics to be covered – . Augmenting existing HDFS and MapReduce infrastructure with dynamically provisioned resources . On-demand creating, growing and shrinking MapReduce infrastructure for user workload . Isolating workloads to enable multi-tenant access to resources . Publishing of resource utilization and accounting information for ingest into charge-back systems |
Session Abstract× Close Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha, Arun MurhtyTez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem. |
|
| 3:45pm - 4:15pm | Break - Sodas and snacks | ||||||
| 4:15pm - 4:55pm | Session Abstract× Close Parquet: Columnar storage for the PeopleJulien Le Dem, Nong LiWe would like to introduce Parquet, a columnar file format for Hadoop. Performance and compression benefits of using columnar storage formats for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases. Parquet supports deeply nested structures, efficient encoding and column compression schemes, and is designed to be compatible with a variety of higher-level type systems. It is available as a standalone library, allowing any Hadoop framework or tool to build support for it with minimal dependencies. As of this release, Parquet is supported by Apache Pig, plain Hadoop Map-Reduce, and Cloudera?s Impala, and is being put into production at Twitter. We will discuss Parquet?s design and share performance numbers. |
Session Abstract× Close Data Science with Hadoop - A PrimerOfer MendelevitchApache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a center-piece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis. In this talk Ofer will provide an overview of data science and how to take advantage of Hadoop for large scale data science projects: * What is data science? * How can techniques like classification, regression, clustering and outlier detection help your organization? * What questions do you ask and which problems do you go after? * How do you instrument and prepare your organization for applied data science with Hadoop? * Who do you hire to solve these problems? You will learn how to plan, design and implement a data science project with Hado |
Session Abstract× Close Secure Hadoop @eBayBenoy Antony, Jos BackuseBay has large, multi-purpose Hadoop clusters with many petabytes of data. eBay Inc?s many subsidiaries and applications give rise to fascinating, complex scenarios for authentication, audit, access control, data protection, data safety, and privacy. To fulfill these complex requirements, security is enabled on Hadoop clusters at eBay. We have been experimenting with and implementing in our production systems several techniques to 1) Set up and operate large clusters (thousands of nodes) efficiently and effectively for a thousand users, and 2)Enable security transparently without impacting user jobs or over-burdening users with inconvenient restrictions Based on our experiments, we have assembled some effective ?best? practices for deploying, operating and using secure Hadoop?clusters. In this presentation, we explain these methods and rules-of-thumb that efficiently and effectively: 1)Set up reliable, scalable Infrastructure to enable strong security for large clusters 2)Set up a very large secure Hadoop cluster in a scalable, zero-touch automated way 3)Update software and configuration on the cluster efficiently while minimizing ?drift? among machines 4)Keep Hadoop services up and running with minimum human intervention 5)Manage user Access to Hadoop clusters? 6)Control access to data and other resources 7)Process highly confidential data using map-reduce programs |
Session Abstract× Close SQL on Hadoop: Defining the New Generation of Analytic DatabasesCarl SteinbachThe analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today. |
Session Abstract× Close Simplifying Use of Hive with the Hive Query ToolStephen ScaffidiAs TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without finding what we needed, we began working on our own solution to meet our specific goals and use-cases. The Hive Query Tool (HQT) is a web interface that allows anybody to configure and run Hive queries without requiring client-side installation or even knowledge of the query language. Users familiar with HQL can add sophisticated and highly customizable queries with a flexible and powerful template system. A primary innovation, the template system, allows one to define the inputs available to the end-user, validation checks, and what HQL to generate, easily and concisely. We plan to release the code as open-source. This talk will discuss: – The features of the HQT and how it is used for business intelligence – The challenges it was built to meet and how its design and architecture addresses them – Installing and running an HQT server – How to use, customize, and expand the template system – Known limitations and issues – Future plans and features |
Session Abstract× Close Efficient processing of large and complex XML documents in HadoopSujoe BoseMany systems capture XML data in Hadoop for analytical processing. When XML documents are large and have complex nested structures, processing such data repeatedly would be inefficient as parsing XML becomes CPU intensive, not to mention the inefficiency of storing XML in its native form. The problem is compounded in the Big Data space, when millions of such documents have to be processed and analyzed within a reasonable time. In this talk an efficient method is proposed by leveraging the Avro storage and communication format, which is flexible, compact and specifically built for Hadoop environments to model complex data structures. XML documents may be parsed and converted into Avro format on load, which can then be accessed via Hive using a SQL-like interface, Java MapReduce or Pig. A concrete use-case is provided that validates this approach along with variations of the same and their relative trade-offs. |
|
| 4:55pm - 5:05pm | Break | ||||||
| 5:05pm - 5:45pm | Session Abstract× Close Hello OpenStack, meet HadoopJohn Speidel, Ilya EltermanHadoop is often viewed as needing racks of dedicated boxes -despite the fact that in sheer number terms, the majority of Hadoop clusters ever created have been brought up on public cloud infrastructures -particularly Amazon`s. Yet the rest of datacenter computing is moving towards virtualization -be it in-cloud startups or in-enterprise IT departments. Some organizations are standing up private clouds: a rack or two of servers with an API for VM creation. Hadoop can live there -it just needs to integrate better. At the same time, OpenStack is emerging as the de-facto standard open source cloud platform for private use, and is available publicly from a number of cloud infrastructure service providers. This talk looks at what we`ve done -and are doing- to integrate Hadoop with OpenStack. This is taking it beyond Hadoop`s current support for Amazon`s infrastructure, making a combined Hadoop + OpenStack cluster something to consider in-house -and in-cloud. This work is being done in collaboration with members of the OpenStack community, showing how cloud and big data projects can not only co-exist, we can co-develop our platforms. |
Session Abstract× Close Pattern - an open source project for migrating predictive models from SAS, etc., onto HadoopPaco Nathan“Pattern” is an open source project which takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Apache Hadoop. This machine learning library works by translating PMML — an established XML standard for predictive model markup — into data workflows based on the Cascading API in Java. PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, K-Means, Hierarchical Clustering, etc., with more machine learning support being added. Benefits include greatly reduced development costs and less licensing issues at scale ?- while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Sample code in the talk will show apps using predictive models built in SAS and R, e.g., anti-fraud classifiers. In addition, examples will show how to compare variations of models for large-scale customer experiments. Portions of this material come from the O`Reilly book “Enterprise Data Workflows with Cascading”, due June 2013. |
Session Abstract× Close A cluster is only as strong as its weakest linkDan RomikeEarly detection and correction of cluster health issues is a vital part of daily cluster management, no matter the size. Building and managing a healthy cluster is the best cure for meeting service level agreements and preventing or avoiding elongated troubleshooting. A cluster is effective and efficient when problems are detected and eliminated early. Fortunately, deploying simple tools and processes prevents minor problems from becoming major headaches. This talk covers how we developed, tested, and deployed a comprehensive health process based on real life events and experiences. The table driven health check runs a full scan in ~2 seconds and includes: a checklist, ‘positive’ error pattern matching, enabling and disabling node blacklisting, logging, validating file systems, processing very large log files, trapping in-rack network faults (adds 5 seconds to accurately detect packet loss), and recommissioning nodes into production. |
Session Abstract× Close Next Generation Analytics: A Reference ArchitectureZubin DowlatyWhat is the reference architecture for creating an analytics shop in today?s environment? This presentation will review how businesses are mapping out their next generation analytics architecture based on various decision layers within the Hadoop ecosystem. The architecture also takes into consideration how data engineering, data science, decision science, and decision support play a role in this architecture. This framework recommends how modern analytical departments structure their technology thinking, as well as locate the gaps. This presentation will provide attendees with a better understanding of how to map out their next generation analytics architecture within the Hadoop ecosystem: -Review standard examples witnessed in the industry, compare industry standard vs. reference architecture. Highlight areas where companies are capable, as well as under-represented areas. – Explore why the decision science layer is one of the most omitted areas in production analytics departments. -Review the importance of scale. Predictive methods are required to scale analytics into the operation. – Explain impact of intelligent systems on the technology selection, physics for data in flight vs. data at rest . – Constructing model management systems that are implemented into production. Discuss possible solutions of this perplexing problems. |
Session Abstract× Close Deploying Apache Flume to enable low-latency analyticsMike PercyThe driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components. |
Session Abstract× Close Windows on Hadoop: Integrating the Desktop and the Apache Hadoop Linux PlatformWilliam HeinzmanWith its many applications and available presentation frameworks, the Microsoft Windows desktop can enable enterprise-wide business intelligence tools providing query, analysis, and reporting in a common and familiar environment. However, integrating the desktop, particularly to .NET, can be challenging, especially when cross-platform interoperability to Apache Hadoop running on Linux platforms is a requirement. Cross-platform interoperability technologies must not only be flexible and robust, they must also maximize ease of development and enable minimum time to deployment, particularly where silos of expertise for Hadoop/Linux/Java and Windows/.NET are present, but separate, in the enterprise. This session will look at current cross-platform technologies-both open source and third-party-in relation to design requirements, architecture and implementation. Using the available interoperability technologies with the current APIs in the Hadoop stack for HDFS, HBase and Hive, the paper will place emphasis on the integration ease, required expertise, robustness and life cycle management necessary for the enterprise to deploy custom desktop front-ends for Hadoop. |
|
| 5:45pm - 7:00pm | Exhibitor Reception | ||||||
| 7:00pm - 10:00pm | Party | ||||||
| Day 2 » Thursday, June 27thGo to Day 1 | |||||||
|---|---|---|---|---|---|---|---|
|
Tracks:
Hadoop Driven Business / Business Intelligence Reference Architectures Hadoop (Disruptive) Economics Future of Apache Hadoop Enterprise Data Architecture Deployment and Operations Applications and Data Science |
|||||||
| 7:30am - 8:30am | Breakfast | ||||||
| 8:30am - 10:00am | Keynotes and Plenary | ||||||
| 10:00am - 10:30am | Break - Coffee and pastries | ||||||
| 10:30am - 11:10am | Session Abstract× Close A Birds-Eye View of Pig and Scalding Jobs with hRavenGary Helmling, Joep RottinghuisAs Twitter’s use of mapreduce rapidly expands, tracking usage on our clusters grows correspondingly more difficult. With an ever increasing job load, and a reliance on higher level abstractions such as Pig and Scalding, the utility of existing tools for viewing job history decreases rapidly, and extracting insights becomes a challenge. At Twitter, we created hRaven to fill this gap. hRaven archives the full history and metrics from all mapreduce jobs on our clusters, and strings together each job from a Pig or Scalding script execution into a combined flow. From this archive, we can easily derive aggregate resource utilization by user, pool, or application. While the historical trending of an individual application allows us to perform runtime optimization of resource scheduling. We will cover how hRaven provides a rich historical archive of mapreduce job execution, and how the data is structured into higher level flows representing the job sequence for frameworks such as Pig, Scalding, and Hive. We will then explore how we mine hRaven data to account for Hadoop resource utilization, to optimize runtime scheduling, and to identify common anti-patterns in user jobs. Finally, we will look at the end user experience, including Ambrose integration for flow visualization. |
Session Abstract× Close An In-depth Look at Putting the Sting in HiveAlan GatesApache Hive is the most widely used SQL interface for Hadoop. As Hadoop usage continues its explosive growth, Hive`s performance and features do not meet the requirements and expectation of many users. This includes answering queries in human time (less than 30 seconds) and support for common analytics operations. The Hive community has risen to the challenge. Work is being done to drive down start up time of a Hive query, extend Hive to work on Tez (a Hadoop execution environment that is much faster than MapReduce), make Hive operators process records at 10x more than their current speed, add support for analytics and windowing functions such as RANK, NTILE, LEAD, LAG, etc., and add support to Hive for standard SQL datatypes. This talk will discuss the design and code changes that have been done as well as look at ongoing work and additional optimizations and features that could be added in the future. |
Session Abstract× Close Continuous Integration for the Applications on top of Hadoop.Wisely Chen, Neal LeeWriting pig script, push to cloud, waiting, get wrong answer , reprogram pig script…. tired of that development process. CI can help you. We setup a standard CI process using some open source tool like Jenkins,git,Puppet,hadoop,pig,pigunit, Ubuntu, dpkg. This process will get fast feedback if the pig script is wrong and improve your productive. We will give a brief demo in this talk. You can find the CI flow diagram in this https://docs.google.com/document/d/1lS_JgW1CPSWdTfL3DHR0tbCU9Sw3CnqbRD56VvNvnHM/edit . Bellow is the demo technical detail. The OS is Ubuntu and package system is dpkg. We use Jenkins for CI server. Git is source code repos and we use Puppet for deployment. Pigunit are the unit testing language. This abstract is approved by Yahoo! internal. |
Session Abstract× Close Using Hadoop for Vital signs and EMR data in Healthcare Research and Patient CareMohamed ElmallahResearchers and care providers wanted to have access to all of the patients` vitals signs (temperature, blood pressure, heart rate, and respiratory rate) but most of this data wasn?t recorded, only a few readings a day were posted to the patients Electronic Medical Record (EMR). The EMR isn`t meant to store such volume of data, let alone to perform any data mining on it. This session will describe the architecture of the solution that was implemented to collect these vital signs automatically from Bedside Medical Devices (BDMI), and store them into a temporary storage, then load them into a Hadoop cluster. The session will also cover how the team married this vital signs data in the HDFS (Hadoop File System) with the rest of the EMR data for our Principles Investigators (PI) in our research institute to search for correlations between administered medications, diagnosis, and vital signs readings. The session will describe the reasons behind the design decisions that were made, such as using a Cloud Hadoop cluster versus on-premises while maintaining HIPAA. |
Session Abstract× Close Wrangling Customer Usage Data with HadoopMatt Johnson, Carmen HallAt Clearwire we have a big data challenge: Processing millions of unique usage records comprising terabytes of data for millions of customers every week. Historically, massive purpose-built database solutions were used to process data, but weren?t particularly fast, nor did they lend themselves to analysis. As mobile data volumes increase exponentially, we needed a scalable solution that could process usage data for billing, provide a data analysis platform, and inexpensively store the data indefinitely. The solution? A Hadoop-based platform allowed us to architect and deploy an end-to-end solution based on a combination of physical data nodes and virtual edge nodes in less than six months. This solution allowed us to turn off our legacy usage processing solution and reduce processing times from hours to as little as 15-min. This improvement has enabled Clearwire to deliver actionable usage data to partners faster and more predictably than ever before. Usage processing was just the beginning; we?re now turning to the raw data stored in Hadoop, adding new data sources, and starting to analyze the data. Clearwire is now able to put multiple data sources in the hands of our analysts for further discovery and actionable intelligence. |
Session Abstract× Close Where to deploy Hadoop: Bare metal or Cloud?Sewook WeeDeciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities. |
|
| 11:10am - 11:20am | Break | ||||||
| 11:20am - 12:00pm | Session Abstract× Close Realizing the Enterprise Data ReservoirBen WertherParalleling the rise of Hadoop, there is a new architecture taking hold at leading enterprises (in financial services, web, retail and elsewhere) that turns 30 years of data practices on its head. The ‘data reservoir’ is not an ETL dumping ground for all the data that hasn’t yet been promoted to the data warehouse. It is the opposite — a centralized Hadoop repository where a second copy of siloed data from across the company can be sent, including both transactional warehouse data and newer log/event style datasets, to allow much more exploratory and dynamic cross-functional discovery and data analysis. The Enterprise Data Reservoir is the successor to the Enterprise Data Warehouse — allowing unanticipated questions against all classes of data to be pursued immediately, rather than after 12 months of painstaking ETL, modeling and architecture. But realizing its benefits requires a realistic view of architectural patterns, security and governance, metadata management, scalable interactive exploration/analysis, and more. In this session I’ll share our learning from working with leading implementers, and paint the vision for the Data Reservoir that we find so compelling and transformative. |
Session Abstract× Close Mahout and Scalable Natural Language ProcessingCasey StellaPeter Norvig, the Director of Research at Google, said in the Amazon book review[1] for the book “Statistical Natural Language Processing” “If someone told me I had to make a million bucks in one year, and I could only refer to one book to do it, I`d grab a copy of this book and start a web text-processing company.” Despite this vast availability of free-form text, without a scalable way to process this data we have missed an opportunity. Apache Mahout is an open source machine learning library built atop Hadoop. We will cover how to use it to do some interesting, scalable machine learning (classification, clustering, etc.) on free-form text. We will learn about classification using Random Forests, clustering beyond simple K-Means using Latent Dirichlet Allocation. We will investigate the limitations and abilities of this platform together as it relates to natural language. 1. http://www.amazon.com/review/R3GSYXSKRU8V17 |
Session Abstract× Close Jubatus: real-time and highly-scalable machine learning platformShohei HidoReal-time analytics relates to many critical applications with Big data. Machine learning is a set of computational algorithms for understanding data and predicting the future for accurate decision making. Thus both will be the key factors in Big Data analytics. Though Hadoop-based frameworks such as Mahout are available, there was no framework for distributed AND real-time machine learning. The difficulty lies in the integration of online algorithms and distributed ones since online computation contradicts with distributed environment. Jubatus is the first OSS platform for scale-out deep analytics on Big Data. It efficiently realizes distributed and online machine learning by using three fundamental operations. The key is to share not data samples but only trained models between the servers by MIX operation to achieve both high scalability and fast convergence. UPDATE and ANALYZE correspond to model training and applying, and they work locally on each server so that adding more servers linearly improves the throughput. Jubatus is being used in an official service for reselling Japanese tweets taken from Twitter?s Firehose stream. We expect that Jubatus applications will expand into other areas involving bigger and faster data such as healthcare image analysis, IoT/M2M sensors, and network security. See http://jubat.us/ for more details. |
Session Abstract× Close The 3 T's - Using Hadoop to modernize with faster access to data and valuePhil Shelley, Wuheng LuoNear real-time, big data analytics is a reality via a new data pattern that avoids the latency and overhead of legacy ETL–the 3 T’s of Hadoop: Transfer, Transform, and Translate. Transfer: Once a Hadoop infrastructure is in place, a mandate is needed to immediately and continuously transfer all enterprise data, from external and internal sources and through different existing systems, into Hadoop. Previously, enterprise data was isolated, disconnected and monolithically segmented. Through this T, various source data are consolidated and centralized in Hadoop almost as they are generated in near real-time. Transform: Most of the enterprise data, when flowing into Hadoop, is transactional in nature. Analytics requires data be transformed from record-based OLTP form to column-based OLAP. This T is not the same T in ETL as we need to retain the granularity in the data feeds. The key is to transform in-place within Hadoop, without further data movement from Hadoop to other legacy systems. Translate: We pre-compute or provide on-the-fly views of analytical data, exposed for consumption. We facilitate analysis and reporting, for both scheduled and ad hoc needs, to be interactive with the data for analysts and end users, integrated in and on top of Hadoop. |
Session Abstract× Close Hadoop -- Enabling Expanded Financial Market Analysis Techniques while Improving Investment PerformanceKevin CooganFinancial market analysis has traditionally utilized structured data. Furthermore financial market participants have normally segregated themselves into two main camps; fundamental analysis (ie income statements, accounting information, business strategy, etc.) and technical analysis (ie asset prices, traded volume, market-related data, etc.). Unstructured data, meaning everything from news to press releases to interviews, was left to individual interpretation or ignored. Underperformance of hedge funds and the implosion of investment banks have accelerated the search for new innovative approaches. Big data and Hadoop will be at the core of financial market innovation going forward. Hadoop?s ability to consolidate multi-structured data in a scalable and easily accessible way will turn the advent of big data from a threat to an opportunity. The silo-ed analytical approaches are being replaced by a multi-factor approach based on multi-structured data. The focus of the session will be on big data as the catalyst for change within the financial market community, Hadoop as the enabler, and multiple successful applications of this form of analysis in the financial markets. |
Session Abstract× Close Using Hadoop to do Simulation for High Frequency TradingPaul HaefeleDeep Value has been using Hadoop to do simulations of trading strategies that trade over 3.5% of the US stock market. We provide both high frequency market making and execution strategies. Our largest customer is the NYSE where we provide execution services to the floor broker community. We have taken our high performance, fault tolerant Java trading engine and adapted it to run as a Map-Reduce job. Our execution engine Mapper is then used to pull out the order-by-order data of all orders going into the US stock market and replay these against our production algorithmic logic. We do this to understand if any changes made to the algorithmic logic improve the overall performance of our trading. However this approach, although solving one set of issues (“is this approach better than than that”), creates a new set of challenges. These include not blowing our compute budget (EC2 costs add up so we built our own 50 server base cluster), and deal with the escalating data that these simulations generate. Luckily these are first world problems that Hadoop itself can help us address. We will describe how we went about converting our execution engine to use Hadoop and what components are needed to build a suitable trading simulation environment. We will also examine the types of analysis that we have build on top of the trading data that have helped us us understand what we are doing. |
Session Abstract× Close Feeding the Elephant: Approaching 1PB/DayAaron WiebeAt BlackBerry we had a complex problem: several dozen services, fully distinct in their instrumentation and log formats and with wildly different needs in scale and analysis. The biggest single problem we faced was how to feed that data into Hadoop, and how to manage it once it was there. In this session we will review the use cases that led to the creation of LogDriver, our toolkit for loading, analyzing and managing logs in Hadoop. |
| 12:00pm - 1:15pm | Lunch | ||||||
| 1:15pm - 1:55pm | Session Abstract× Close Hadoop Hardware @Twitter: Size does matter!Joep Rottinghuis, Jay ShenoyAt Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance. |
Session Abstract× Close Watching Pigs Fly with the Netflix Hadoop ToolkitJeff Magnusson, Charles SmithFrameworks and technologies in the Hadoop ecosystem are undergoing rapid innovation, but the open source tooling around usability has lagged behind. We will present a suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Pig workflows and easily interact with and visualize datasets. Netflix?s big data teams have worked for the past year implementing this framework in the AWS cloud. During that time, we have seen a massive influx of data and a corresponding increase in new development on our platform. This toolset has been a critical enabler in minimizing development time and effort. Using the development of a recommendation algorithm as an example, we?ll walk through use cases for this stack of tools, showing how they interact to facilitate development. The presentation will include demos, implementation details, and our roadmap to open source various key services in the framework, including restful services that: provide comprehensive metadata management across data sources; enable visualization and caching of results of Hadoop jobs; visualize the execution plans produced by languages such as Pig and Hive; and provide detailed analytics on the currently executing workload and trends in historical performance. |
Session Abstract× Close Building a Real-time Data Pipeline: Apache Kafka at LinkedInNeha NarkhedeApache Kafka is a simple, high-performance, distributed, fault-tolerant messaging system. It was initially developed at LinkedIn and is now used at many companies, including Twitter, Square, Mozilla, Foursquare, and Tumblr. This talk will cover the architecture of Kafka and how LinkedIn uses Kafka to build a distributed low-latency pipeline that handles all messaging, tracking, logging, and metrics data. This unified pipeline provides data feeds into Hadoop and a diverse set of user-facing real-time stream processing applications. We will describe the lessons learned scaling this service to thousands of data feeds and many terabytes of messages per day. |
Session Abstract× Close Enabling data management in a Big Data worldCraig Soules, Garth GoodsonHadoop has enabled a new scale of data processing that is paving the way for data driven businesses. However, business data is often riddled with compliance and regulatory requirements that can be easily lost as data is manipulated, transformed, and re-written within the Hadoop eco-system. Furthermore, enterprise data is often scattered across a wide array of systems, each with their own techniques for policy management. As data from these disparate systems is loaded into Hadoop, all of the carefully crafted policy is immediately lost, creating a potential risk for the business. Data provenance is widely recognized as a technique for applying policy in more traditional industries such as storage, databases and high-performance computing. By tracking data from its origin and across various transformations and computations, provenance tracking systems can answer questions such as: Who has seen a given piece of data? Where did this data come from? What policies existed on this data? In this talk, we will discuss traditional data management solutions, the challenges of bringing them to an eco-system like Hadoop, and approaches to enable data management in the growing Big Data world. |
Session Abstract× Close Hadoop and HBase @eBayMing MaeBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed. |
Session Abstract× Close High Performance Predictive Analytics in R and HadoopLee EdlefsenHadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS. |
|
| 1:55pm - 2:05pm | Break | ||||||
| 2:05pm - 2:45pm | Session Abstract× Close Phoenix: How (and why) we put the SQL back into the NoSQLJames TaylorPhoenix is an open source project from Salesforce.com that puts a SQL layer on top of HBase, a NoSQL store. This talk will focus on answering two questions: 1) why put a SQL layer on top of a NoSQL store? and 2) how does Phoenix marry the SQL paradigm back together with the NoSQL world? Salesforce.com uses relational database technology extensively, so a big part of the ?why? for us is to provide developers with a familiar API to read and write their data. Another reason is to provide a higher level abstraction such that performance optimizations can be done without impacting the client API. Two good examples are with filtering data and aggregating data. Both operations can be executed on the client or the server-side with equivalent functional results. However, the performance characteristics between these two choices is dramatically different. This leads to the second question, of ?how? to implement a SQL layer on top of a key/value store. Here we?ll demonstrate the openness of the HBase API and how it provides enough flexibility to support a higher level language such as SQL to be surfaced, while at the same time allowing us to obtain an order of magnitude performance improvement over prior attempts to do the same. We?ll focus on how HBase enables us to follow the principle of ?bringing the computation to the data? to achieve these dramatic performance improvements. |
Session Abstract× Close Running YARN at scaleRobert EvansAt Yahoo! over the past year we have helped migrate hundreds of our grids? users to YARN. Our YARN clusters have in aggregate run over 18 million jobs with more than 3 billion tasks consuming over 10 thousand years of compute time. With one single cluster running 90 thousand jobs a day. From this experience we would like to share what we have learned about running YARN well, how this is different from running a 1.0 based cluster, and what it takes to migrate your jobs to YARN from 1.0. |
Session Abstract× Close Durkheim Project: Social Media Risk & Bayesian CountersChris Poulin, Alex KozlovCited by a 2012 TIME Magazine cover story (“One A Day”) suicide, particularly the military, is a severe public health problem: Veteran suicide rates, nearly double those of adults in the general U.S. population. And to date there has been a lack of success so far in military efforts to understand and address the suicide crisis: “No program, outreach or initiative has worked against the surge in Army suicides, and no one knows why nothing works.” (Time) In this talk we will describe how we have built a real time risk assessment framework with the US Veterans Administration. As well as how Hadoop and HBase are being used to build further systems based on our new Bayesian Counters framework to predict realtime risk. Bayesian Counters framework was, in part, developed to predict military mental health risks. Trying to help to solve this complicated puzzle, towards the goal of reducing suicidality among those who have served the nation. |
Session Abstract× Close Go Beyond 'Debug': Wire Tap your App for Knowledge with HadoopOleg Zhurakousky, Tom McCuchToday, application developers devote roughly 80% of their code to persisting roughly 20% of the total data flowing through the applications. That means two things: * 80% of the data flowing through our applications is at best lost in rolling log files, at worst never collected — without ever being analyzed or accounted for. * Application-level database programming, licensing, storage, administration, and ETL processing have maxed out IT budgets and have constrained app development teams from keeping pace with the rate of change in the business. The other 80% of the data is “Event Data” that can no longer be ignored if you want to stay competitive. Changes to application state are already stored as a sequence of events in application and middleware logs. In fact, since this data never held value to anyone but the developer in the past, a lot of potentially valuable information is often never collected. With Hadoop, we can: * store and query these events – Transaction tracing, * use the event log to reconstruct the application domain at any point in time – ETL, * use the same event log to construct new domains we haven`t planned for – ELT, and * automatically adjust our data domains to cope with retroactive changes – ??? In this talk, we will demonstrate how capturing all event data could dramatically simplify data collection and management within the enterprise. |
Session Abstract× Close Big data, Easy BITim Hsu, Neal LeeCurrently, the Yahoo EC Taiwan team provides business performance matrix to users by acquiring data from the Web production and Back office ERP systems. The reporting system is built using traditional BI technologies such as RDBMS, ETL tools, OLAP tools, home-made reporting tools, store procedures, web pages,?. With increasing usage growth of the user browsing data in the business decision on daily basis, The ability to provided data analytics on these Big Data is getting more and more important and needed. The traditional RDBMs have reaching its limit in process big data while connecting to OLAP tool. We started with the feasibility of connecting MicroStrategy with Hive 0.9 and created a prototype system to test in two scenarios – ad-hoc query to Hive and performance test of the predefined MicroStrategy Intelligent Cube for ad-hoc analytics. We did the performance test on Ad-hoc query via HiveQL and query from MicroStrategy cube, and will share the result in the session. Based on our test results, we will be able to provide the following applications to different types of users. A) Ad-hoc query running against Hadoop can allow well trained data analyst or power users to have deeper analysis on data within Hadoop. B) OLAP reports running against MicroStrategy Intelligent Cube can provide quicker response time on ad-hoc analytics with predefined data in Cube. |
Session Abstract× Close Video Analysis in Hadoop - A Case StudyAlex Gorbachev, Alan GardnerOur secure remote connectivity tool provides full video recording of all work our engineers perform on client systems. We have requirements to analyze the video log to detect suspicious activity, provide forensic and root cause analysis capabilities. Some of the obvious use cases include detection of credit card patterns or personally identifiable information (PII) as well as malicious activity like dropping database objects. We need to process hundreds of gigabytes per day representing thousands of hours of video. Our solution leverages a variety of Hadoop components to perform optical text recognition and indexing, keyboard and mouse movement analysis as well as integration with variety of other data sources such as our monitoring, documentation, ticketing and communication systems. We will present our complete architecture starting from multi-source data ingestion through data processing and analysis up to the end user interface, reporting and integration layer. |
|
| 2:45pm - 3:15pm | Break - Sodas and snacks | ||||||
| 3:15pm - 3:55pm | Session Abstract× Close Big Data 2.0: Hadoop as part of a Near-Real-Time Integrated Data EraPhil ShelleyA new era of big data is coming, an era we would call ?Big Data 2.0,? with characteristics including: 1. The lines between data and metadata, storage and processing logic become further blurred 2. Data integration pattern is shifting from ETL (extract, transform and load) to the 3 T?s in Hadoop (transfer, transform and translate) 3. Batch-oriented data pipeline is challenged, even surpassed by stream-based data flow 4. In-memory big data processing emerges as a new promising trend 5. Latency from raw data to business intelligence is dramatically shortened toward real-time or near real-time 6. Hadoop and other No-SQL solutions are further integrated into the same environment 7. Mapping and conversion between relational/row-based and column-based data becomes end-user friendly 8. More ad hoc, interactive, query-based analytics outgrow pure MapReduce 9. Hadoop evolves from data server-centric to client rich 10. Hadoop becomes the centerpiece of enterprise data systems, with roles of database, data warehouse, and data center storage, all in one, as integrated platform and solutions This vision of Big Data 2.0 is based on Sears? research, development and production experience, and best practice in enterprise data solutions, which indicate that Hadoop is ready for its prime time in this new era. |
Session Abstract× Close Fast, Scalable Graph Processing: Apache Giraph on YARNEli ReismanApache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph’s recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets. |
Session Abstract× Close Large scale near real-time log indexing with Flume and SolrCloudAri FlinkApache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy. |
Session Abstract× Close Building and Improving Products with HadoopMatthew RathboneIn many instances the terms `big data` and `Hadoop` are reserved for conversations on business analytics. Instead, I posit that these technologies are most powerful when they are deployed as a way to both build new products, and improve existing ones. Measurement is a fundamental part of the process, but more importantly I will walk through an effective tool-chain that can be used to: a) build unique new products, based on data. b) test improvements to a product At Foursquare, we`ve used a Hadoop-based tool chain to build new products (like social-recommendations), and to improve existing features through initiatives such as experimentation, and offline data generation. These products and improvements are fundamental to our core business, yet their existence would not be possible without Hadoop. I will pull examples from Foursquare and other companies to demonstrate these points, and outline the infrastructure components needed to accomplish them. |
Session Abstract× Close Falcon - Data Management Platform on Hadoop (Beyond ETL)Srikanth Sundarrajan, Venkatesh SeetharamHadoop and its ecosystem of products have made storing and processing massive amounts of data common place. This has enabled numerous businesses to gain valuable foresights that they never could have in the past. While it is easy to leverage Hadoop for crunching large volumes of data, organizing data, managing life cycle of data and processing data is fairly involved. This is solved adequately well in a traditional data platform involving data warehouses and standard ETL (extract-transform-load) tools, but remains largely unsolved today. Besides data processing complexities, Hadoop presents new set of challenges relating to management of data. Data Management on Hadoop encompasses data motion (import/export), process orchestration (data pipelines, late/re-processing, scheduling), lifecycle management (retention, replication, DR, anonymization, archival), data discovery (data classification, Lineage), etc. among other concerns that are beyond ETL. The presentation focuses on a new data processing and management platform for Hadoop, Falcon that attempts to solve this problem by leveraging existing stacks in the Hadoop ecosystem. Falcon has been in production for nearly a year at InMobi and has been managing hundreds of feeds and processes. |
Session Abstract× Close Use dependency injection to get Hadoop *out* of your application codeEric ChangHadoop MapReduce provides transparent parallelization but often results in specialized code bases that interact with low-level data formats. We present a means of using dependency injection to manage data flows in MapReduce which in turn supports reusable, Hadoop-agnostic application code that interacts with high-level business domain objects. An example is provided that applies Dependency Injection to the Hadoop WordCount example and shows how the same code invoked from the WordCount MapReduce job can be reused in a real-time context. We then discuss Opower’s application of this pattern to employ the same core calculations in both batch processing and in servicing real-time requests from end users. This topic will be of interest to those interested in reusing core batch calculations in real-time contexts. It also provides a means forward for organizations moving to Hadoop that have existing code components that they would like to employ in batch MapReduce computations. |
|
| 3:55pm - 4:05pm | Break | ||||||
| 4:05pm - 4:45pm | Session Abstract× Close Master Chief Loves Hive -- Using Hadoop in the CloudMike FlaskoHalo 4 is a widely popular video game with millions of players. As the game prepared to launch, the Microsoft team behind the game was tasked with analyzing business intelligence data to gain insights into player preferences and support an online contest. To handle those requests, the team used the Windows Azure HDInsight Service, a Hadoop-based service in the cloud, to gain deep insight into the millions of concurrent gamers that lead to weekly Halo 4 updates and support email campaigns designed to increase player retention. In this session we?ll rebuild the Halo big data pipeline from start to finish. |
Session Abstract× Close Recommender System at scale using HBase and HadoopDhaval ShahRecommender Systems play a crucial role in a variety of businesses in today`s world. From E-Commerce web sites to News Portals, companies are leveraging data about their users to create a personalizes user experience, gain competitive advantage and eventually drive revenue. Dealing with the sheer quantity of data readily available can be a daunting task by itself. Consider applying machine learning algorithms on top of it and it makes the problem exponentially complex. Fortunately, tools like Hadoop and HBase make this task a little more manageable by taking out some of the complexities of dealing with a large amount of data. In this talk, we will share our success story of building a recommender system for Bloomberg.com leveraging the Hadoop ecosystem. We will describe the high level architecture of the system and discuss the pros and cons of our design choices. Bloomberg.com operates at a scale of 100s of millions of users. Building a recommendation engine for Bloomberg.com entails applying Machine Learning algorithms on terabytes of data and still being able to serve sub-second responses. We will discuss techniques for efficiently and reliably collecting data in near real-time, the notion of offline vs. online processing and most importantly, how HBase perfectly fits the bill by serving as a real-time database as well as input/output for running MapReduce. |
Session Abstract× Close Compression Options in Hadoop - A Tale of TradeoffsGovind Kamat, Sumeet SinghYahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined. |
Session Abstract× Close How we solved Real-time User Segmentation using HBaseGiang Nguyen, Murtaza DoctorAt RichRelevance, we service 10 of the top 20 Internet retailer chains and deliver more than $5.5 billions in attributable sales. Every 21 milliseconds a shopper clicks on a recommendation that we have delivered, and we serve over 850 million product recommendations daily. Our Hadoop infrastructure has a capacity to handle upwards of 1.5+ PB. Behavioral Targeting, specifically user segmentation and building personas, is critical for us in generating triggers when a user is added to a segment or switches from a segment. In this presentation, we intend to demonstrate not only how the events are captured, but also how they are stored in HBase in real-time. It is critical to design the system so it can handle thousands of writes per second and, at the same time, be able to query any combination of behavioral attributes in HBase through real-time APIs. This session will walk attendees through the entire design & architecture starting from data Ingestion, schema design, and access patterns, as well as some major problems like sharing & hot spotting. Furthermore, performance metrics will be presented, including the number of read/write per second and details around cluster configuration. |
Session Abstract× Close Don't be Hadooped When Looking for Big Data ROI: How Use Case Segmentation Drives Target Architectures and Technology Selection at Deutsche TelekomJuergen UrbanskiExtracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value lies with thinking smart and hard about the business requirements for a Big Data solution. There is a long list of crucial questions to think about. Is Hadoop really the best solution for all Big Data needs? Should companies run a Hadoop cluster on expensive enterprise-grade storage, or use cheap commodity servers? Should the chosen infrastructure be bare metal or virtualized? The picture becomes even more confusing at the analysis and visualization layer. The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes in choosing hardware and software. This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and the selection of technologies and vendors. |
Session Abstract× Close ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceEric Hanson, Jitendra Pandey, Owen O'MalleyHive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator. |
|
| 4:45pm - 4:55pm | Break | ||||||
| 4:55pm - 5:35pm | Session Abstract× Close EAP - Accelerating behavorial analytics at PayPal using HadoopRahul Bhartia, Alexei VassilievPayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology team at PayPal has built a configurable engine, Event Analytics Pipeline (EAP), using Hadoop to ingest and process massive amounts of customer interaction data, match business-defined behavioral patterns, and generate entities and interactions matching those patterns. The pipeline is an ecosystem of components built using HDFS, HBase, a data catalog, and seamless connectivity to enterprise data stores. EAP?s data definition, data processing, and behavioral analysis can be adapted to many business needs. Leveraging Hadoop to address the problems of size and scale, EAP promotes agility by abstracting the complexities of big-data technologies using a set of tools and metadata that allow end users to control the behavioral-centric processing of data. EAP abstracts the massive data stored on HDFS as business objects, e.g., customer and page impression events, allowing analysts to easily extract patterns of events across billions of rows of data. The rules system built using HBase allows analysts to define relationships between entities and extrapolate them across disparate data sources to truly explore the universe of customer interaction and behaviors through a single lens. |
Session Abstract× Close How Hadoop enables Analytics for Law enforcement and National securityGary M. Shiffman, PhDThe US National Security enterprise must fight and win the nation’s wars, protect and defend Americans citizens and US friends and allies, at home and abroad, and preserve the US democratic form of government and fee markets. Government leaders, although facing different budgetary and institutional goals and constraints than actors in the commercial world, must come to terms with the volume, velocity, and variety of data in the world and across their enterprises. Hadoop has recently arrived in federal office buildings in a big way, at least as a term to be discussed if not yet a data store deployed. As of June 2013, Hadoop appears destined for widespread adoption in the federal government, although the path forward remains uncertain. This session will provide a strategic view of the US federal government “big data” landscape as a marketplace, and suggest areas for highest return on investment for the national security community. Focusing on analytics—the use cases and the end customer at the “pointy end of the spear” seems likely to illuminate the most efficient way forward. Gary M. Shiffman is an economist, professor at Georgetown University in Washington, DC, and CEO of Giant Oak, a data analytics company focusing on detecting fraud, crime, insurgency, terrorism, and other forms of organized theft and violence. |
Session Abstract× Close Lessons learned with Hadoop in the cloud and migrating to the datacenterJoe CrobakMany organizations start using Apache Hadoop to solve a specific problem, and in this context, a dynamically-sized cluster in a cloud environment makes a lot of sense. Once Hadoop gets traction in your organization, though, there are a lot of challenges in scaling out a Hadoop deploy in the cloud. Traditional wisdom is that Hadoop works much better on bare-metal hardware than in a virtualized environment but being in the cloud also has its benefits. At Foursquare, we?ve moved from Elastic MapReduce to a persistent cluster in Amazon Web Services to a bare metal cluster in our colo. We?ll talk about our motivation for these migrations as well as what went well, what didn?t go so well, and what we?d never do again. In addition, we?ll share war stories about our setup, such as ?resizing HDFS in a weekend? and moving terabytes of data from S3 to HDFS in the colo. |
Session Abstract× Close A Reference Architecture for ETL 2.0George Trujillo, George VetticadenMore and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop. |
Session Abstract× Close Designing Data Pipelines using HadoopMarilson CamposThis presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting. |
Session Abstract× Close What's New in the Berkeley Data Analytics StackTathagata Das, Reynold XinThe Berkeley Data Analytics Stack (BDAS) aims to address emerging challenges in data analysis through a set of systems, including Spark, Shark and Mesos, that enable faster and more powerful analytics. In this talk, we’ll cover two recent additions to BDAS: * Spark Streaming is an extension of Spark that enables high-speed, fault-tolerant stream processing through a high-level API. It uses a new processing model called “discretized streams” to enable fault-tolerant stateful processing with exactly-once semantics, without the costly transactions required by existing systems. This lets applications process much higher rates of data per node. It also makes programming streaming applications easier by providing a set of high-level operators on streams (e.g. maps, filters, and windows) in Java and Scala. * Shark is a Spark-based data warehouse system compatible with Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions. It employs a number of novel and traditional database optimization techniques, including column-oriented storage and mid-query replanning, to efficiently execute SQL on top of Spark. The system is in early use at companies including Yahoo! and Conviva. |
Session Abstract× Close Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yieldErich Hochmuth, Robert GrailerMonsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern. |