30 second pitch
highlights
make
command will build the Docker containers for Apache Spark and Apache Hadoop, initialize the environment, verify input data and generate output reportproject plan
Timeline may be pulled in by paralleling some tasks. Security, compliance and scope could affect the plancompliance checklist
Figure below shows the topology of how the data pipeline executes on a single node development machine
sales_by_states.py
A mobile friendly presentation can be viewed here It has been generated from the Rmarkdown present inside docs
Prerequisite
Ubuntu 16.04
docker 18.09.6
and docker-compose 1.24.0
to be available and ready to usemake
Assumes that input data is clean
You may need to configure a proxy to pull the docker images. Do not run in production !!
Start here
Data pipeline executes inside the docker containers, on a development machine. The entire pipeline is automated through a self documented makefile. Executing make
command in the root of this repository will build the docker containers for Spark and Hadoop, start them, verify input data and generate the report
Either execute make
in the root of the repository or execute individual commands make setup start verify report
. Most commands are idempotent
Explore other commands using make help
$ make help
all setup start verify report
clean-output Delete output data
connect To enter the Spark container
report Print the output report and save it to a file
setup Build Docker containers
start Starts Spark and Hadoop. Jupyter is at localhost:8888
stop Stop and remove the containers
verify Check if the input data is skewed
Output of make report
is shown below. It is saved locally as well as in Hadoop
$ make report
AK#2016#8#1#11#123458
AK#2016#8#1##123458
AK#2016#8###123458
AK#2016####123458
AK#####123458
AL#2017#8#1#10#123457
AL#2017#8#1##123457
AL#2017#8###123457
AL#2017####123457
AL#2016#8#1#12#123459
AL#2016#8#1##123459
AL#2016#8###123459
AL#2016####123459
AL#####246916
CA#2016#2#1#9#246912
CA#2016#2#1##246912
CA#2016#2###246912
CA#2016####246912
CA#####246912
OR#2016#2#1#9#123456
OR#2016#2#1##123456
OR#2016#2###123456
OR#2016####123456
OR#####123456
The report has been saved to hdfs://hadoop:9000/output and locally at dataout/sales_by-state.txt
When finished
Execute make stop
on your host machine. This stops and removes the containers
Areas of improvement
Run check_input.py
from the host machine
$ time make verify
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)
real 0m8.836s
user 0m0.041s
sys 0m0.025s
Now run the same script directly inside the container
$ make connect
jovyan@jupyter:~/work$ time spark-submit check_input.py 2>/dev/null
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)
real 0m8.280s
user 0m14.407s
sys 0m1.373s
real
, user
and sys
timesuser
time is more than real
time inside the containersysstat
Plot the performance metrics as an interactive time seriesThis chart shows sysstat
metrics for 4 executions of sales_by_states.py
job. The metrics are reported by the quantile given in the table below
metric | unit | quantile |
---|---|---|
cpu | %busy | 75% |
disk | %util | 99.5% |
runq | unit | 99% |
Folder dataout
contains sysstat metrics for CPU, memory, network, disk and proc, after you run the collect-sar
and parse-sar
commands inside the jupyter
container. head -n2 dataout/*.dat