Clientele ➞

Big Data Analytics with Apache Spark for Developers – Level 2

Apache Spark

Duration: 2 Days


This course is targeted at developers who want to step into the world of Big Data with Spark. Big Data usage is constantly on the rise and is immensely useful for a multitude of applications. These applications require a large amount of data processing quickly and efficiently. Spark’s biggest strength is it’s in-memory computing capabilities that perform operations on data rapidly across distributed clusters. Considering Spark’s extremely handy use cases and efficiency, the framework will never be obsolete. This course details the fundamental aspects of Spark with a deep dive into its various concepts and capabilities. The objectives of this course align with Hortonworks’ HDPCD certification and can be invaluable for developers and data scientists looking to get a comprehensive insight into Spark. Why should you get Hortonworks Data Platform Certified Developer (HDPCD) Certification?
  • The Hortonworks Data Platform certification is one of the most sought-after certifications
...Read more


  • Learn the fundamentals of Spark
  • Implement Spark applications
  • Learn how Spark compares to other popular algorithms such as MapReduce
  • Understand advanced features of Spark such as performance tuning
  • Prepare for the certification exam: Hortonworks Data Platform Certified Developer

Who Should Attend

  • Developers who are working or are expected to work on Big Data and Analytics
  • Software professionals looking for a shift to developing Spark applications
  • Candidates looking to get certified as a Hortonworks Data Platform Certified Developer


Course Outline

Day 1

  1. Introduction to Apache Spark
    • Types of Data Processing
    • Apache Spark Features
    • RDDs
    • Spark Build Architecture
    • Spark Components
    • Spark Tools

    Hands-on: Installation and Configuration of Apache Spark

  2. PySpark Packages
    • SQL Module
    • Streaming Module
    • ML Package
    • MLlib Package

    Hands-on: Importing Relational Data to PySpark SQL Module and Summarizing the Data

  3. Working with PySpark
    • Overview of PySpark
    • Data Flow with SparkContext
    • Spark Data Frames
    • RDD Transformations
    • RDD Actions
    • Shuffle Operations
    • Understanding Closures
    • Operations on RDDs Using Sets
    • Persisting and Caching
    • Performance Tuning of Spark Jobs

    Hands-on: Perform Entity Count Processing on Sample Data Using PySpark

  4. SparkUI
    • Jobs
    • Stages
    • Storage
    • Environment
    • Executors

    Hands-on: Setting-up and Managing Jobs Using SparkUI

Day 2

  1. RDD Pairs
    • Introduction to RDD Pairs and Creation
    • Performing Transformations
    • Performing Reduce Operation
    • Partitioning Data
    • Performing Join Operations

    Hands-on: Performing Transformations and Actions on Pair RDDs

  2. Advanced Spark Features
    • Commonly Used Advanced Programming Features
    • Accumulators
    • Broadcast Variables

    Hands-on: Using Accumulators and Broadcast Variables

  3. Running Spark Jobs in a Cluster
    • Spark Cluster Architecture
    • Running Spark-submit Job
    • Monitoring Spark Jobs on Cluster
    • Scheduling Spark Job on Cluster

    Hands-on: Running a Spark Job on a YARN Cluster

About The Trainer


Arzan Amaria
Sr. Solutions Architect – Cloud and IoT

Arzan has more than 9 years of experience in Microsoft infrastructure technology stack, Data Science, Cloud and IoT. He has great amount of experience in deploying Cloud based solutions. He is a multi-cloud professional with exposure to Azure, AWS and other IIoT Cloud platforms like GE Predix and IBM Watson.

As a Cloud Solution Architect at CloudThat, he is an expert at deploying, supporting and managing client infrastructures on Azure. Having core training and consulting experience, he specializes in delivering individual training and corporate training on Azure. He is also engaged in extensive research and development in the field of IoT and Data Science and leads a team for the same. He has delivered trainings on IoT and is currently designing Cloud integrated solutions.

He has been training professionals for various Microsoft partners such as Wipro, HPE, HCL, Infosys, Accenture, TCS and many more in the recent past.

He holds following Certifications:

  • GE Predix Certified Developer
  • Microsoft Certified Trainer (MCT)
  • CTT+ (Certified Technical Trainer)
  • MCSD: Azure Solutions Architect
  • MCSE (Server Track)
  • MCTS in Machine Learning
  • VCA-DCV (Data Center Virtualization – Associate)
  • Microsoft Certified Specialist with Hyper – V Virtualization
  • AWS Certified Solutions Architect – Associate Level
  • CEH (Certified Ethical Hacker, EC Council University US)

MCT CompTIA Cloud Essentials Microsoft Certified Solutions Developer
MCSE Microsoft Certified Technology Specialist MCTS vmware certified professional data center virtualization
Microsoft Specialist Server Virtualization with Windows Server AWS Solutions Architect Associate Certified Ethical Hacker

Other Details


For latest batch dates, fees, location and general inquiries, contact our sales team at: +91 8880002200 or

Upcoming Batches


Quick Inquiry: Big Data

Favorite Courses
No Favourites added yet.

Our Partners