Clientele ➞

Hortonworks Data Platform (HDP) Apache Spark Developer Certification BootCamp – Level 3: HDPCD

Hortonworks

Duration: 3 Days

Description

The Hortonworks Certification is one of the prominent certifications in the Big Data field available in the market today. The usage of Big Data in various fields is increasing exponentially due to it’s in-memory computing capabilities among other features. Applications in these fields can require large amounts of data processing and analysis to be performed on data rapidly and efficiently across distributed clusters. This course delves into advanced concepts such as Spark SQL, Dataframes and its uses, configure a Spark cluster’s properties, source different types of data into Spark and conclude with PySpark tuning and memory management. The objectives of this course align with Hortonworks Data Platform Certified Developer (HDPCD) Certification objectives and can be invaluable for data scientists and developers aiming to gain a comprehensive insight into Spark. Why should you get Hortonworks Data Platform Certified Developer (HDPCD) Certification?

Objectives

  • Implement advanced Spark features and applications
  • Understand Spark SQL, Dataframes and Spark properties in-depth
  • Learn how Apache Arrow is used in Spark and how it compares with Pandas
  • Learn how to configure and tune Spark performance
  • Prepare for the certification exam: Hortonworks Data Platform Certified Developer (HDPCD)

Who Should Attend

  • Developers who are working or are expected to work on Big Data and Analytics
  • Software professionals looking for a shift to developing Spark applications
  • Candidates looking to get certified as a Hortonworks Data Platform Certified Developer

Prerequisites

Should have attended the following courses offered by CloudThat Technologies or should have equivalent knowledge:

Course Outline

Day 1

  1. Spark SQL
    • Introduction to Spark SQL and Usage
    • Spark SQL Joins
    • RDD vs Dataframes
    • Converting between RDDs and Dataframes
    • Improving Performance of Spark SQL
    • Spark Session

    Hands-on: Converting between RDDs and Dataframes

  2. Spark Dataframes
    • Introduction to Spark Dataframes
    • Basics of Spark Dataframes
    • Spark Dataframes Functions

    Hands-on: Create Dataframes and Perform Commonly used Operations

  3. Spark Properties
    • Application Properties
    • Runtime Environment
    • Execution Behavior
    • Networking
    • Security
    • Encryption
    • Overriding Configuration Directories
    • Inheriting Hadoop Cluster Configuration

    Hands-on: Configuring a Spark Cluster with Custom Properties

Day 2

  1. Data Sources
    • Load / Save Functions
    • Parquet File Format
    • ORC File Format
    • JSON Files
    • Hive Tables

    Hands-on: Creating a Data Source and Loading JSON File
    Hands-on: Using Hive Table as a Data Source in Apache Spark

  2. Distributed SQL Engine
    • Running the Thrift JDBC / ODBC Driver
    • Running the Spark SQL CLI

    Hands-On: Querying Data using Spark SQL CLI

  3. Apache Arrow
    • PyArrow Use Cases
    • PyArrow Installation
    • Enable Conversion to and from Pandas

    Hands-on: Interface PyArrow with Pandas

  4. Utilizing Hive with ORC in Apache Spark REPL
    • Using the Interactive Spark Shell
    • Reading Files from HDFS
    • Creating RDDs from HDFS Files
    • Using Spark API for Dataset Analysis
    • Creating Hive Tables in ORC Format
    • Using Spark SQL to Query Hive Tables
    • Persisting Data

    Hands-on: Analyze Hive Data Using Spark

Day 3

  1. PySpark Memory Management and Data Handling
    • In-memory Computation and Tuning Overview
    • Memory Tuning
    • Large Scale Data Handling
    • Garbage Collection Tuning
    • Other Contributing Factors
  2. PySpark Common Issues
    • Tricky Deployment for Spark
    • Memory Issues
    • Frequent Release – API Dependencies
    • Python Libraries Par against Java

Project: Performing Data Analysis on Real Life Dataset

About The Trainer

arzan

Arzan Amaria
Sr. Solutions Architect – Cloud and IoT

Arzan has more than 9 years of experience in Microsoft infrastructure technology stack, Data Science, Cloud and IoT. He has great amount of experience in deploying Cloud based solutions. He is a multi-cloud professional with exposure to Azure, AWS and other IIoT Cloud platforms like GE Predix and IBM Watson.

As a Cloud Solution Architect at CloudThat, he is an expert at deploying, supporting and managing client infrastructures on Azure. Having core training and consulting experience, he specializes in delivering individual training and corporate training on Azure. He is also engaged in extensive research and development in the field of IoT and Data Science and leads a team for the same. He has delivered trainings on IoT and is currently designing Cloud integrated solutions.

He has been training professionals for various Microsoft partners such as Wipro, HPE, HCL, Infosys, Accenture, TCS and many more in the recent past.

He holds following Certifications:

  • GE Predix Certified Developer
  • Microsoft Certified Trainer (MCT)
  • CTT+ (Certified Technical Trainer)
  • MCSD: Azure Solutions Architect
  • MCSE (Server Track)
  • MCTS in Machine Learning
  • VCA-DCV (Data Center Virtualization – Associate)
  • Microsoft Certified Specialist with Hyper – V Virtualization
  • AWS Certified Solutions Architect – Associate Level
  • CEH (Certified Ethical Hacker, EC Council University US)

Other Details

Questions?

For latest batch dates, fees, location and general inquiries, contact our sales team at: +91 8880002200 or sales@cloudthat.in

Upcoming Batches

TBA

Quick Inquiry: Big Data


Favorite Courses
No Favourites added yet.

Our Partners