Giter Site home page Giter Site logo

patricklss / python-and-spark-for-data-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datamindedbe/python-and-spark-for-data-analysis

0.0 1.0 0.0 24.49 MB

A four-day course on Python, the Scientific Python stack and PySpark, adapted from a training course given by Patrick Varilly to one of our clients in December 2015

Jupyter Notebook 99.98% Python 0.02%

python-and-spark-for-data-analysis's Introduction

Python and Spark for Data Analysis

These are the IPython notebooks I used for a 4-day training course on Python and Spark for data science, given in December 2015 to a Data Minded client. The audience consisted of experienced data analysts, familiar with technologies like R and SPSS, but who had never used Python and had never worked on a Hadoop cluster.

The content is mildly redacted to remove all references to the actual client, but are otherwise unchanged.

Each day consisted of working through a series of IPython notebooks. Exercises are interspersed throughout. The last notebook of each day contains solutions to that day's exercises.

Objectives

The objectives of the training were to:

  • Learn the fundamentals of Python
  • Learn the fundamentals of its statistical and machine learning packages
  • Learn Apache Spark using Python
  • Learn how to apply these technologies in a live Hadoop cluster

Pre-requisites

Before the start of the course, we required the following software to be installed on students' laptops:

  • Anaconda 2.4.1 64-bit for Windows. The packages in this version of Anaconda included:
    • Python 2.7.11
    • IPython 4.0.1
    • NumPy 1.9.3
    • SciPy 0.16.0
    • Matplotlib 1.5.0
    • Pandas 0.17.1
    • Seaborn 0.6.0
    • Scikit-learn 0.17
  • Apache Spark 1.2.0. The version was chosen to match that in the client's production cluster, even though the latest version at the time of the course was 1.5.2
  • JDK 7u79.

Syllabus

The four days covered the following content.

Day 0: Fundamentals of Python

This day was intended for people with very limited programming experience and/or no Python experience. Day 0 was optional.

At the end of this day, the students were able to:

  • Start and run python programs interactively with python CLI
  • Use an IDE to write programs and execute them, including command line arguments
  • Create notebooks locally and on a server
  • Import libraries
  • Store data in variables and understand their reach
  • Know the standard operators
  • Control the flow of a program
  • Perform common string operations such as concatenation, substring, replace
  • Use the correct data structures
  • Use functions to structure your program

Day 1: Statistical and Machine Learning Packages

On Day 1, we discussed several of the powerful statistical and machine learning libraries in Python. It was purposely a very hands on introduction and we did not dive into the mathematics behind any of the algorithms.

At the end of this day, the students were able to:

  • Import and export data in csv
  • Use numpy/scipy to perform mathematical computations
  • Slice and dice data
  • Use pandas to wrangle data
  • Plot data and perform exploratory analysis
  • Use scikit-learn
  • Perform regression analysis in Python
  • Perform classification analysis in Python

Day 2: Apache Spark and Python

On the second day, we dove into Spark. We focused on the essential parts. After a brief introduction into Spark Core, we explored Spark SQL and Spark MLlib.

At the end of this day, the students were able to:

  • Understand the role of Spark and pyspark in the eco-system
  • Run spark locally from a shell
  • Run spark locally in IPython Notebooks
  • Do a word count on an input file
  • Load data in SparkSQL
  • Query data in SparkSQL
  • Use Spark MLlib to perform regression and classification analyses at scale

Day 3: Python and Apache Spark on a Cluster

In this last day, we set up a small Cloudera Hadoop cluster on AWS and explored how everything we had learned could be run in a cluster environment. The second half of the day was set aside for an open-ended project. Possible projects included:

  1. setting up a machine learning pipeline on data from the UCI Machine Learning Repository;
  2. implementing a machine learning algorithm using Spark Core;
  3. testing to what extent Spark running times scales linearly with data size.

At the end of this day, the students were able to

  • Run python scripts on the cluster from a shell and from ipython notebooks
  • Use Spark to read from and write to HDFS
  • Use SparkSQL to read data from and write data to Hive
  • Understand how YARN works
  • Submit spark jobs on the cluster
  • Use Spark, SparkSQL and Spark MLlib to run algorithms on large-scale data.

python-and-spark-for-data-analysis's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.