Training

HDP Analyst: Data Science

Overview: Learn Data Science techniques and best practices leveraging the Hadoop ecosystem and tools.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Location: The Convention Center Dublin

Objectives:

  • Recognize use cases for data science
  • Describe the architecture of Hadoop and YARN
  • Explain the differences between supervised and unsupervised learning
  • List the six machine learning tasks
  • Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
  • Use Mahout to run a machine learning algorithm on Hadoop 
  • Write Pig scripts to transform data on Hadoop
  • Use Pig to prepare data for a machine learning algorithm
  • Write a Python script 
  • Use NumPy to analyze big data
  • Use the data structure classes in the pandas library
  • Write a Python script that invokes a SciPy machine learning algorithm
  • Explain the options for running Python code on a Hadoop cluster
  • Write a Pig User Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Write a Python script that invokes a scikit-­-learn machine learning algorithm
  • Use the k-­-nearest neighbor algorithm to predict values based on a data set
  • Run the k-­-means clustering algorithm on a distributed data set on Hadoop
  • Describe use cases for Natural Language Processing (NLP)
  • Run an NLP algorithm on a Hadoop cluster
  • Run machine learning algorithms on Hadoop using Spark MLlib

Labs:

  • Describe the architecture of Hadoop and YARN
  • Explain the differences between supervised and unsupervised learning
  • Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
  • Write Pig scripts to transform data on Hadoop
  • Use Pig to prepare data for a machine learning algorithm
  • Write a Python script using NumPy, Scipy, Matplotlib, Pandas, and Scikit-learn to analyze big data
  • Exercise the options for running Python code on a Hadoop cluster
  • Write a Pig User Defined Function in Python
  • Use Pig streaming on Hadoop with a Python scriptRun a Hadoop Streaming job
  • Understand some key tasks in Natural Language Processing (NLP)
  • Run an NLP algorithms on IPython
  • Run machine learning algorithms on Hadoop using Spark MLlib

Target Audience: Developers and Analysts who would like to learn more about developing data produces using Hadoop tools such as Pig and Spark and how to use common Data Science tools like python their Hadoop system.

Pre-requisites: No previous Hadoop or programming knowledge is required. It is helpful to have some college level mathematics (such as linear algebra and statistics). Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs. Students are required to bring their own laptop.


HDP Developer: Apache Pig and Hive

Overview: This course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Pig and Hive.  Introductory SPARK content will also be presented.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Location: The Convention Center Dublin

Objectives:

  • Describe Hadoop ecosystem tools and frameworks
  • Describe the HDFS and YARN architectures
  • Use the Hadoop client to input data into HDFS
  • Transfer data between Hadoop and a relational database
  • Use Pig to explore and transform data in HDFS
  • Understand how Hive tables are defined and implemented
  • Use Hive to explore and analyze data sets
  • Explain and use the various Hive file formats
  • Use Hive to run SQL-like queries to perform data analysis
  • Explain the uses and purpose of HCatalog
  • Present the Spark ecosystem and high-level architecture

Labs:

  • Use HDFS commands to add/remove files and folders
  • Explore, transform, split and join datasets using Pig
  • Use Pig to transform and export a dataset for use with Hive
  • Use HCatLoader and HCatStorer
  • Perform a join of two datasets with Hive
  • Use advanced Hive features: windowing, views, ORC files
  • Use Hive analytics functions
  • Use Spark Core to read files and perform data analysis
  • Create and join DataFrames with Spark SQL

Target Audience: Software developers who need to understand and develop applications for Hadoop

Pre-requisites: No previous Hadoop knowledge is required, though will be useful. Students should be familiar with programming principles and have experience in software development. SQL knowledge is also helpful. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs. Students are required to bring their own laptop.


HDP Operations: Hadoop Administration I

Overview: This course is designed for administrators who will be managing the Hortonworks Data Platform (HDP) 2.3 with Ambari. It covers installation, configuration, and other typical cluster maintenance tasks.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Objectives:

  • Install HDP
  • Add, Remove, Replace Cluster Nodes
  • Configure Rack Awareness
  • Configure High Availability NameNode and YARN Resource Manager
  • Manage Hadoop Services
  • Manage HDFS Storage
  • Manage YARN
  • Configure Capacity Scheduler
  • Monitor Cluster

Labs:

  • Install HDP
  • Managing Ambari User and Groups
  • Manage Hadoop Services
  • Using Hadoop Storage
  • Managing Hadoop Storage
  • Managing YARN Service using Ambari Web UI
  • Managing YARN Service using CLI
  • Setting UP for Capacity Scheduler
  • Managing YARN Containers and Queues
  • Managing YARN ACLs and User Limits
  • Adding, Decommissioning and Recommissioning Worker Nodes
  • Configuring Rack Awareness
  • Configuring NameNode HA
  • Configuring ResourceManger HA

Location: The Convention Center Dublin

Target Audience: IT administrators and operators responsible for installing, configuring and supporting an HDP 2.3 deployment in a Linux environment using Ambari.

Pre-requisites: No previous Hadoop knowledge is required, though will be useful. Attendees should be familiar with data center operations and Linux system administration. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs. Students are required to bring their own laptop.


HDP Developer: Apache Spark Using Python

Overview: This course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Spark. The focus will be on utilizing the Spark API from PYTHON.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Objectives:

  • Describe Spark and Spark specific use cases
  • Explain the differences between Spark and MapReduce
  • Explore data interactively through the spark shell utility
  • Explain the RDD concept
  • Use the PYTHON Spark APIs
  • Create all types of RDDs: Pair, Double, and Generic
  • Use RDD type-specific functions
  • Explain interaction of components of a Spark Application
  • Explain the creation of the DAG schedule
  • Build and package Spark applications
  • Use application configuration items
  • Deploy applications to the cluster using YARN
  • Use data caching to increase performance of applications
  • Implement advanced features of spark
  • Learn general application optimization guidelines/tips
  • Create/transform data using dataframes
  • Read, use, and save to different Hadoop file formats
  • Understand the concepts of Spark Streaming
  • Create a streaming application
  • Use Spark MLlib to gain insights from data

Labs:

  • Create a Spark "Hello World" word count application
  • Use advanced RDD programming to perform sort, join, pattern matching and regex tasks
  • Explore partitioning and the Spark UI
  • Increase performance using data caching
  • Build/package a Spark application using Maven
  • Use a broadcast variable to efficiently join a small dataset to a massive dataset
  • Use an accumulator for reporting data quality issues
  • Create a dataframe and perform analysis
  • Load/transform/store data using Spark with Hive tables
  • Create a point-in-time spark stream application
  • Create a spark stream application using window functions

Target Audience: Developers, Architects, and Admins who would like to learn more about developing data applications in Spark, how it will affect their environment, and ways to optimize application.

Pre-requisites: No previous Hadoop knowledge is required, though will be useful. Basic knowledge of PYTHON is required. Previous exposure to SQL is helpful, but not required. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs. Students are required to bring their own laptop.


HDP Overview: Apache Hadoop Essentials

Overview: This course details the business value for, and provides a technical overview of, Apache Hadoop. It includes high-level information about concepts, architecture, operation, and uses of the Hortonworks Data Platform (HDP) and the Hadoop ecosystem. The course serves as an optional primer for those who plan to attend a hands-on, instructor-led course.

Duration: One Day - Tuesday, April 12

Location: The Convention Center Dublin

Objectives:

  • Understand what constitutes "Big Data" and how Hadoop is critical to process & analyze it
  • Describe the business value and primary use cases for Hadoop
  • Understand how Hadoop fits into your existing infrastructure and processes
  • Explore Hadoop ecosystem through HDP's five pillars
    • Data Management: HDFS and YARN
    • Data Access: Spark, Storm, Pig, Hive, Tez, MapReduce, HBase, Accumulo, HCatalog, Kafka, Solr, Mahout and Slider
    • Data Governance & Integration: Atlas, Falcon, Sqoop and Flume
    • Security: Knox and Ranger
    • Operarations: Ambari, Oozie and ZooKeeper
  • Discuss the value of partner integrations and the Modern Data Architecture
  • Learn about the security features that span the Hadoop ecosystem
  • Share knowledge to allow decisions to be made of how Hadoop can be used in enterprise use cases and architectures

Demos:

  • Operational Overview with Ambari
  • Ingesting Data into HDFS
  • Streaming Data into HDFS
  • Data Manipulation with Hive
  • Risk Factor Analysis with Pig
  • Risk Factor Analysis with Spark
  • Securing Hive with Ranger

Target Audience: Data architects, data integration architects, managers, C-level executives, decision makers, technical infrastructure team, and Hadoop administrators or developers who want to understand the fundamentals of Big Data and the Hadoop ecosystem.

Pre-requisites: No previous Hadoop or programming knowledge is required. Students are encouraged to bring their wi-fi enabled laptop pre-loaded with the Hortonworks Sandbox should they want to duplicate demonstrations on their own machine


HDP Operations: Security

Overview: This course is designed for experienced administrators who will be implementing secure Hadoop clusters using authentication, authorization, auditing and data protection strategies and tools.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Objectives:

  • Describe 5 Pillars of Security
  • Choose security tool for use case
  • Security Prerequisites
  • Ambari Server Security
  • Apache Ranger
  • Apache Ranger KMS
  • Using Ranger to Secure Access
  • Perimeter Security - Apache Knox

Labs:

  • Accessing Your Cluster
  • Configure Name Resolution and Certificate to Active Directory
  • Setup Ambari to Active Directory Sync
  • Kerberize the Cluster
  • Setup AD/OS Integration via SSSD
  • Configure Ambari Server for Kerberos
  • Ranger Prerequisites
  • Ranger Install
  • Ranger KMS/Data Encryption Setup
  • Ranger KMS/Data Encryption Exercise
  • Knox Configuration

Target Audience: Experienced IT administrators who will be implementing security on an existing HDP 2.3 cluster using Ambari.

Pre-requisites: Students should be experienced in the management of Hadoop using Ambari and Linux environments. Completion of the Hadoop Administration I course is highly recommended. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs.


HDF Operations: Hortonworks DataFlow

Overview: This condensed course is designed for 'Data Stewards' or 'Data Flow Managers' who are looking forward to automate the flow of data between systems.

Duration: Two Days - Monday, April 11–Tuesday, April 12

Objectives:

  • Understand what is HDF and what is Nifi, Core concepts and use cases
  • Understanding Nifi Architecture and Key features
  • Learn deep about Nifi User interface and how to build a Data flow
  • Understanding a Nifi Processor, Connection, Process Groups and Remote Process Groups
  • Basic overview of Data Flow Optimization and Data Provenance
  • Understanding Nifi Expression Language 
  • Installing and Configuring a Nifi Cluster 
  • Understanding security and monitoring options for HDF 
  • Integrating HDF and HDP
  • HDF System and Nifi best practices

Labs:

  • Installing and Starting NiFi
  • Building a NiFi Data Flow
  • Working With Processor Group
  • Working With Remote Processor Group [Site-­-to-­-Site]
  • NiFi Expression Language
  • Using Templates
  • Working With NiFi Cluster
  • NiFi Monitoring
  • HDF Integration with HDP [Spark,Kafka,Hbase]
  • Securing HDF with 2-way SSL
  • NiFi User Authentication with LDAP
  • End of the course project

Target Audience: Data Engineers, Integration Engineers and Architects who are looking to automate Data flow between systems

Pre-requisites: Good to have some experience with Linux and basic understanding of DataFlow tools. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs. Students are required to bring their own laptop.


sponsor purchase
community partners