Python and Spark for Data Analysis

These are the IPython notebooks I used for a 4-day training course on Python and Spark for data science, given in December 2015 to a Data Minded client. The audience consisted of experienced data analysts, familiar with technologies like R and SPSS, but who had never used Python and had never worked on a Hadoop cluster.

The content is mildly redacted to remove all references to the actual client, but are otherwise unchanged.

Each day consisted of working through a series of IPython notebooks. Exercises are interspersed throughout. The last notebook of each day contains solutions to that day's exercises.

Objectives

The objectives of the training were to:

Learn the fundamentals of Python
Learn the fundamentals of its statistical and machine learning packages
Learn Apache Spark using Python
Learn how to apply these technologies in a live Hadoop cluster

Pre-requisites

Before the start of the course, we required the following software to be installed on students' laptops:

Anaconda 2.4.1 64-bit for Windows. The packages in this version of Anaconda included:
- Python 2.7.11
- IPython 4.0.1
- NumPy 1.9.3
- SciPy 0.16.0
- Matplotlib 1.5.0
- Pandas 0.17.1
- Seaborn 0.6.0
- Scikit-learn 0.17
Apache Spark 1.2.0. The version was chosen to match that in the client's production cluster, even though the latest version at the time of the course was 1.5.2
JDK 7u79.

Syllabus

The four days covered the following content.

Day 0: Fundamentals of Python

This day was intended for people with very limited programming experience and/or no Python experience. Day 0 was optional.

At the end of this day, the students were able to:

Start and run python programs interactively with python CLI
Use an IDE to write programs and execute them, including command line arguments
Create notebooks locally and on a server
Import libraries
Store data in variables and understand their reach
Know the standard operators
Control the flow of a program
Perform common string operations such as concatenation, substring, replace
Use the correct data structures
Use functions to structure your program

Day 1: Statistical and Machine Learning Packages

On Day 1, we discussed several of the powerful statistical and machine learning libraries in Python. It was purposely a very hands on introduction and we did not dive into the mathematics behind any of the algorithms.

At the end of this day, the students were able to:

Import and export data in csv
Use numpy/scipy to perform mathematical computations
Slice and dice data
Use pandas to wrangle data
Plot data and perform exploratory analysis
Use scikit-learn
Perform regression analysis in Python
Perform classification analysis in Python

Day 2: Apache Spark and Python

On the second day, we dove into Spark. We focused on the essential parts. After a brief introduction into Spark Core, we explored Spark SQL and Spark MLlib.

At the end of this day, the students were able to:

Understand the role of Spark and pyspark in the eco-system
Run spark locally from a shell
Run spark locally in IPython Notebooks
Do a word count on an input file
Load data in SparkSQL
Query data in SparkSQL
Use Spark MLlib to perform regression and classification analyses at scale

Day 3: Python and Apache Spark on a Cluster

In this last day, we set up a small Cloudera Hadoop cluster on AWS and explored how everything we had learned could be run in a cluster environment. The second half of the day was set aside for an open-ended project. Possible projects included:

setting up a machine learning pipeline on data from the UCI Machine Learning Repository;
implementing a machine learning algorithm using Spark Core;
testing to what extent Spark running times scales linearly with data size.

At the end of this day, the students were able to

Run python scripts on the cluster from a shell and from ipython notebooks
Use Spark to read from and write to HDFS
Use SparkSQL to read data from and write data to Hive
Understand how YARN works
Submit spark jobs on the cluster
Use Spark, SparkSQL and Spark MLlib to run algorithms on large-scale data.

patricklss / python-and-spark-for-data-analysis Goto Github PK