Spark Demo

Executive Summary. Problem definition, highlights, project plan

30 second pitch

Implement an Apache Spark Core application which aggregates the customer and sales data sets in HDFS
Report total sales revenue in each state for the year, month, day and hour granularity

highlights

Data is stored in text format
Design considerations for both finite and a large number of customers
Uses Apache Spark’s Python APIs for RDD
Performance may be affected if the data is skewed for example if a handful of customers are responsible for the majority of sale. A python script has been provided to check the skewness of input data
Data pipeline executes inside the Docker containers, on a development machine. Executing a single make command will build the Docker containers for Apache Spark and Apache Hadoop, initialize the environment, verify input data and generate output report
Complete source code, runnable docker containers and documentation, including the source code of this presentation is available in a public repository on Github
This is a proof of concept only and it is not production ready, yet. The project plan below provides an estimate of how long it might take to get it ready for production

project plan

Timeline may be pulled in by paralleling some tasks. Security, compliance and scope could affect the plan

compliance checklist

there may be customer identifiable information in the input data
there is no customer identifiable information in the output data
security
GDPR compliant
architectural approval

Figure below shows the topology of how the data pipeline executes on a single node development machine

Data processing pipeline. A simplified view of transformations applied to input data by the script `sales_by_states.py`

Demo, walk through the code and discuss design considerations

A mobile friendly presentation can be viewed here It has been generated from the Rmarkdown present inside docs

Prerequisite

Tested on Ubuntu 16.04
Requires docker 18.09.6 and docker-compose 1.24.0 to be available and ready to use
Requires make
Execute the commands in bash
Assumes that input data is clean

You may need to configure a proxy to pull the docker images. Do not run in production !!

Start here

Data pipeline executes inside the docker containers, on a development machine. The entire pipeline is automated through a self documented makefile. Executing make command in the root of this repository will build the docker containers for Spark and Hadoop, start them, verify input data and generate the report

Either execute make in the root of the repository or execute individual commands make setup start verify report. Most commands are idempotent

Explore other commands using make help

  $ make help
  all                            setup start verify report
  clean-output                   Delete output data
  connect                        To enter the Spark container
  report                         Print the output report and save it to a file
  setup                          Build Docker containers
  start                          Starts Spark and Hadoop. Jupyter is at localhost:8888
  stop                           Stop and remove the containers
  verify                         Check if the input data is skewed

Output of make report is shown below. It is saved locally as well as in Hadoop

  $ make report
  AK#2016#8#1#11#123458
  AK#2016#8#1##123458
  AK#2016#8###123458
  AK#2016####123458
  AK#####123458
  AL#2017#8#1#10#123457
  AL#2017#8#1##123457
  AL#2017#8###123457
  AL#2017####123457
  AL#2016#8#1#12#123459
  AL#2016#8#1##123459
  AL#2016#8###123459
  AL#2016####123459
  AL#####246916
  CA#2016#2#1#9#246912
  CA#2016#2#1##246912
  CA#2016#2###246912
  CA#2016####246912
  CA#####246912
  OR#2016#2#1#9#123456
  OR#2016#2#1##123456
  OR#2016#2###123456
  OR#2016####123456
  OR#####123456
  The report has been saved to hdfs://hadoop:9000/output and locally at dataout/sales_by-state.txt

When finished

Execute make stop on your host machine. This stops and removes the containers

Areas of improvement

HDFS should preferably be on a separate machine or a cluster. Same for Apache Spark
Better management of data at rest
Auto generate test data of adequate size and auto load it in Hadoop
Add unit, functional and integration tests
Add performance tests and harvest performance metrics for Apache Spark

Is my data skewed? Identify and eliminate hot spots

Run check_input.py from the host machine

$ time make verify 
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)

real  0m8.836s
user  0m0.041s
sys   0m0.025s

Now run the same script directly inside the container

$ make connect
jovyan@jupyter:~/work$ time spark-submit check_input.py 2>/dev/null
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)

real  0m8.280s
user  0m14.407s
sys   0m1.373s

If the data is skewed, it may not be partitioned properly and could create a hot spot
Quantiles could provide a more effective measure of skewness
Notice the difference between real, user and sys times
Notice that the user time is more than real time inside the container

What does the output data look like? Plot the output and download summarized output data

Check Spark performance metrics using `sysstat` Plot the performance metrics as an interactive time series

This chart shows sysstat metrics for 4 executions of sales_by_states.py job. The metrics are reported by the quantile given in the table below

metric	unit	quantile
cpu	%busy	75%
disk	%util	99.5%
runq	unit	99%

Folder dataout contains sysstat metrics for CPU, memory, network, disk and proc, after you run the collect-sar and parse-sar commands inside the jupyter container. head -n2 dataout/*.dat

How can we improve performance?

Consider Scala
Consider Dataframes
Consider Streaming
Tune the Scheduler
Consider Apache Parquet for data at rest
Partitioning
Read multiple files
Data locality
Collect statistics
Improve serialization
Check memory pressure and garbage collection
Increase parallelism
Filtering
Caching
Joins
Aggregation
Broadcast variables
Consider hardware acceleration for example, using FPGA to filter in the storage itself

Detect and predict a performance problem in distributed systems

Control charts for example, qicharts
Sliding window counters of errors to discover unexpected events, in real time. Ref Sandhya Menon’s thesis, 2006.
Use clustering to identify hidden connections and patterns
Linear regression
Simple learning techniques such as Decision Trees or ensembles as demonstrated by Earl in 2017 or use more advanced machine learning if necessary
Discrete and stochastic event simulation
Our ability to predict is highly dependent on how we frame the problem