Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools
Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.
Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools
Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.
Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
Preface xi
Introduction 1
PART I: SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and
Spark 5
Introduction to Big Data, Distributed Computing, and
Hadoop 5
A Brief History of Big Data and
Hadoop 6
Hadoop Explained
7
Introduction to Apache Spark 13
Apache Spark
Background 13
Uses for Spark
14
Programming Interfaces to
Spark 14
Submission Types for Spark
Programs 14
Input/Output Types for Spark
Applications 16
The Spark RDD
16
Spark and Hadoop
16
Functional Programming Using Python 17
Data Structures Used in Functional Python
Programming 17
Python Object
Serialization 20
Python Functional Programming
Basics 23
Summary 25
Chapter 2 Deploying Spark
27
Spark Deployment Modes 27
Local Mode 28
Spark Standalone
28
Spark on YARN
29
Spark on Mesos
30
Preparing to Install Spark 30
Getting Spark 31
Installing Spark on Linux or Mac OS X
32
Installing Spark on Windows 34
Exploring the Spark Installation 36
Deploying a Multi-Node Spark Standalone
Cluster 37
Deploying Spark in the Cloud 39
Amazon Web Services
(AWS) 39
Google Cloud Platform
(GCP) 41
Databricks 42
Summary 43
Chapter 3 Understanding the Spark Cluster
Architecture 45
Anatomy of a Spark Application 45
Spark Driver
46
Spark Workers and
Executors 49
The Spark Master and Cluster
Manager 51
Spark Applications Using the Standalone
Scheduler 53
Spark Applications Running on
YARN 53
Deployment Modes for Spark Applications Running on
YARN 53
Client Mode 54
Cluster Mode
55
Local Mode
Revisited 56
Summary 57
Chapter 4 Learning Spark Programming
Basics 59
Introduction to RDDs 59
Loading Data into RDDs 61
Creating an RDD from a File or
Files 61
Methods for Creating RDDs from a Text File
or Files 63
Creating an RDD from an Object
File 66
Creating an RDD from a Data
Source 66
Creating RDDs from JSON
Files 69
Creating an RDD
Programmatically 71
Operations on RDDs 72
Key RDD Concepts
72
Basic RDD
Transformations 77
Basic RDD Actions
81
Transformations on
PairRDDs 85
MapReduce and Word Count
Exercise 92
Join
Transformations 95
Joining Datasets in
Spark 100
Transformations on
Sets 103
Transformations on Numeric
RDDs 105
Summary 108
PART II: BEYOND THE BASICS
Chapter 5 Advanced Programming Using the Spark Core
API 111
Shared Variables in Spark 111
Broadcast
Variables 112
Accumulators
116
Exercise: Using Broadcast Variables and
Accumulators 119
Partitioning Data in Spark 120
Partitioning
Overview 120
Controlling
Partitions 121
Repartitioning
Functions 123
Partition-Specific or Partition-Aware API
Methods 125
RDD Storage Options 127
RDD Lineage
Revisited 127
RDD Storage
Options 128
RDD Caching
131
Persisting RDDs
131
Choosing When to Persist or Cache
RDDs 134
Checkpointing RDDs
134
Exercise: Checkpointing
RDDs 136
Processing RDDs with External Programs
138
Data Sampling with Spark 139
Understanding Spark Application and Cluster
Configuration 141
Spark Environment
Variables 141
Spark Configuration
Properties 145
Optimizing Spark 148
Filter Early, Filter
Often 149
Optimizing Associative
Operations 149
Understanding the Impact of Functions and
Closures 151
Considerations for Collecting
Data 152
Configuration Parameters for Tuning and
Optimizing Applications 152
Avoiding Inefficient
Partitioning 153
Diagnosing Application Performance
Issues 155
Summary 159
Chapter 6 SQL and NoSQL Programming with
Spark 161
Introduction to Spark SQL 161
Introduction to
Hive 162
Spark SQL
Architecture 166
Getting Started with
DataFrames 168
Using DataFrames
179
Caching, Persisting, and Repartitioning
DataFrames 187
Saving DataFrame
Output 188
Accessing Spark
SQL 191
Exercise: Using Spark
SQL 194
Using Spark with NoSQL Systems 195
Introduction to
NoSQL 196
Using Spark with
HBase 197
Exercise: Using Spark with
HBase 200
Using Spark with
Cassandra 202
Using Spark with
DynamoDB 204
Other NoSQL
Platforms 206
Summary 206
Chapter 7 Stream Processing and Messaging Using
Spark 209
Introducing Spark Streaming 209
Spark Streaming
Architecture 210
Introduction to
DStreams 211
Exercise: Getting Started with Spark
Streaming 218
State Operations
219
Sliding Window
Operations 221
Structured Streaming 223
Structured Streaming Data
Sources 224
Structured Streaming Data
Sinks 225
Output Modes
226
Structured Streaming
Operations 227
Using Spark with Messaging Platforms
228
Apache Kafka
229
Exercise: Using Spark with
Kafka 234
Amazon Kinesis
237
Summary 240
Chapter 8 Introduction to Data Science and Machine
Learning Using Spark 243
Spark and R 243
Introduction to R
244
Using Spark with R
250
Exercise: Using RStudio with
SparkR 257
Machine Learning with Spark 259
Machine Learning
Primer 259
Machine Learning Using Spark
MLlib 262
Exercise: Implementing a Recommender Using
Spark MLlib 267
Machine Learning Using Spark
ML 271
Using Notebooks with Spark 275
Using Jupyter (IPython) Notebooks with
Spark 275
Using Apache Zeppelin Notebooks with
Spark 278
Summary 279
Index 281
Jeffrey Aven is an independent Big Data, open source
software and cloud computing professional based out of Melbourne,
Australia. Jeffrey is a highly regarded consultant and instructor
and has authored several other books including Teach Yourself
Apache Spark in 24 Hours and Teach Yourself Hadoop in 24 Hours.
![]() |
Ask a Question About this Product More... |
![]() |