Giter Site home page Giter Site logo

parallel-data-analysis's Introduction

Parallel Python: Analyzing Large Datasets

Student Goals

Students will walk away with a high-level understanding of both parallel problems and how to reason about parallel computing frameworks. They will also walk away with hands-on experience using a variety of frameworks easily accessible from Python.

Student Level

Knowledge of Python and general familiarity with the Jupyter notebook are assumed. This is generally aimed at a beginning to intermediate audience.

Outline

For the first half we cover basic ideas and common patterns in parallel computing, including embarrassingly parallel map, unstructured asynchronous submit, and large collections.

For the second half we cover complications arising from distributed memory computing and exercise the lessons learned in the first section by running informative examples on provided clusters.

  • Part one
    • Parallel Map
    • Asynchronous Futures
    • High Level Datasets
  • Part two
    • Processes and Threads. The GIL, inter-worker communication, and contention.
    • Distributed deployment
    • Cluster computing exercises

Chat Room

Stuck? Ask for help here: https://gitter.im/dask/pydata-dc-2016

Installation

  1. Install Anaconda

  2. Update select packages

    Everyone:

     conda install -c conda-forge ipyparallel ujson dask distributed bokeh scikit-learn pytables jupyter
     pip install snakeviz dask distributed --upgrade
    

    Python 2 users:

     conda install futures
    

    Linux/Mac users:

     conda install -c quasiben spark
    

Test your installation:

python -c 'import concurrent.futures, ipyparallel, dask, jupyter, pyspark'

Dataset Preparation

We will generate a dataset for use locally. This will take up about 1GB of space in a new local directory, data/.

pip install fakestockdata
python prep.py

Part 1: Local Notebooks

Part one of this tutorial takes place on your laptop, using multiple cores. Run Jupyter Notebook locally and navigate to the notebooks/ directory.

jupyter notebook

The notebooks are ordered 1, 2, 3, so you can start with 01-map.ipynb

Part 2: Remote Clusters

Part two of this tutorial takes place on a remote cluster.

Visit the following page to start an eight-node cluster: http://bigfatintegral.net/

If at any point your cluster fails you can always start a new one by re-visiting this page.

Warning: your cluster will be deleted when you close out. If you want to save your work you will need to Download your notebooks explicitly.

Slides

Brief, high level slides exist at http://mrocklin.github.com/scipy-2016-parallel/.

Sponsored Cloud Provider

We thank Google for generously providing compute credits on Google Compute Engine.

parallel-data-analysis's People

Contributors

mrocklin avatar minrk avatar ahmadia avatar quasiben avatar rgbkrk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.