Executive Summary. Problem definition, highlights, project plan

30 second pitch

  • Implement an Apache Spark Core application which aggregates the customer and sales data sets in HDFS
  • Report total sales revenue in each state for the year, month, day and hour granularity

highlights

  • Data is stored in text format
  • Design considerations for both finite and a large number of customers
  • Uses Apache Spark’s Python APIs for RDD
  • Performance may be affected if the data is skewed for example if a handful of customers are responsible for the majority of sale. A python script has been provided to check the skewness of input data
  • Data pipeline executes inside the Docker containers, on a development machine. Executing a single make command will build the Docker containers for Apache Spark and Apache Hadoop, initialize the environment, verify input data and generate output report
  • Complete source code, runnable docker containers and documentation, including the source code of this presentation is available in a public repository on Github
  • This is a proof of concept only and it is not production ready, yet. The project plan below provides an estimate of how long it might take to get it ready for production

project plan

Timeline may be pulled in by paralleling some tasks. Security, compliance and scope could affect the plan


compliance checklist

  1. there may be customer identifiable information in the input data
  2. there is no customer identifiable information in the output data
  3. security
  4. GDPR compliant
  5. architectural approval

Figure below shows the topology of how the data pipeline executes on a single node development machine

Data processing pipeline. A simplified view of transformations applied to input data by the script sales_by_states.py

Demo, walk through the code and discuss design considerations

A mobile friendly presentation can be viewed here It has been generated from the Rmarkdown present inside docs

Prerequisite

  1. Tested on Ubuntu 16.04
  2. Requires docker 18.09.6 and docker-compose 1.24.0 to be available and ready to use
  3. Requires make
  4. Execute the commands in bash
  5. Assumes that input data is clean

    You may need to configure a proxy to pull the docker images. Do not run in production !!

Start here

Data pipeline executes inside the docker containers, on a development machine. The entire pipeline is automated through a self documented makefile. Executing make command in the root of this repository will build the docker containers for Spark and Hadoop, start them, verify input data and generate the report

Either execute make in the root of the repository or execute individual commands make setup start verify report. Most commands are idempotent

Explore other commands using make help

  $ make help
  all                            setup start verify report
  clean-output                   Delete output data
  connect                        To enter the Spark container
  report                         Print the output report and save it to a file
  setup                          Build Docker containers
  start                          Starts Spark and Hadoop. Jupyter is at localhost:8888
  stop                           Stop and remove the containers
  verify                         Check if the input data is skewed

Output of make report is shown below. It is saved locally as well as in Hadoop

  $ make report
  AK#2016#8#1#11#123458
  AK#2016#8#1##123458
  AK#2016#8###123458
  AK#2016####123458
  AK#####123458
  AL#2017#8#1#10#123457
  AL#2017#8#1##123457
  AL#2017#8###123457
  AL#2017####123457
  AL#2016#8#1#12#123459
  AL#2016#8#1##123459
  AL#2016#8###123459
  AL#2016####123459
  AL#####246916
  CA#2016#2#1#9#246912
  CA#2016#2#1##246912
  CA#2016#2###246912
  CA#2016####246912
  CA#####246912
  OR#2016#2#1#9#123456
  OR#2016#2#1##123456
  OR#2016#2###123456
  OR#2016####123456
  OR#####123456
  The report has been saved to hdfs://hadoop:9000/output and locally at dataout/sales_by-state.txt

When finished

Execute make stop on your host machine. This stops and removes the containers

Areas of improvement

  1. HDFS should preferably be on a separate machine or a cluster. Same for Apache Spark
  2. Better management of data at rest
  3. Auto generate test data of adequate size and auto load it in Hadoop
  4. Add unit, functional and integration tests
  5. Add performance tests and harvest performance metrics for Apache Spark

Is my data skewed? Identify and eliminate hot spots

Run check_input.py from the host machine

Now run the same script directly inside the container


  1. If the data is skewed, it may not be partitioned properly and could create a hot spot
  2. Quantiles could provide a more effective measure of skewness
  3. Notice the difference between real, user and sys times
  4. Notice that the user time is more than real time inside the container

What does the output data look like? Plot the output and download summarized output data


Check Spark performance metrics using sysstat Plot the performance metrics as an interactive time series


This chart shows sysstat metrics for 4 executions of sales_by_states.py job. The metrics are reported by the quantile given in the table below

metric unit quantile
cpu %busy 75%
disk %util 99.5%
runq unit 99%

Folder dataout contains sysstat metrics for CPU, memory, network, disk and proc, after you run the collect-sar and parse-sar commands inside the jupyter container. head -n2 dataout/*.dat

How can we improve performance?

Detect and predict a performance problem in distributed systems