Giter Site home page Giter Site logo

prabod / data-engineering-101 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lacunamatu/data-engineering-101

0.0 2.0 0.0 11.11 MB

Introduction to Data Engineering workshop, learn to build a data pipeline with Luigi!

License: Other

Python 89.00% HTML 11.00%

data-engineering-101's Introduction

Data Engineering 101: Building a Data Pipeline

This repository contains the files and data from the workshop as well as resources around Data Engineering. For the workshop (and after) we will use a Gitter chatroom to keep the conversation going: https://gitter.im/Jay-Oh-eN/data-engineering-101.

And/or please do not hesitate to reach out to me directly via email at [email protected] or over twitter @clearspandex

The presentation can be found on Slideshare here or in this repository (presentation.pdf). Video can be found here.

Throughout this workshop, you will learn how to make a scalable and sustainable data pipeline in Python with Luigi

Learning Objectives

  • Run a simple 1 stage Luigi flow reading/writing to local files
  • Write a Luigi flow containing stages with multiple dependencies
    • Visualize the progress of the flow using the centralized scheduler
    • Parameterize the flow from the command line
    • Output parameter specific output files
  • Manage serialization to/from a Postgres database
  • Integrate a Hadoop Map/Reduce task into an existing flow
  • Parallelize non-dependent stages of a multi-stage Luigi flow
  • Schedule a local Luigi job to run once every day
  • Run any arbitrary shell command in a repeatable way

Prerequisites

Prior experience with Python and the scientific Python stack is beneficial. The workshop will focus on using the Luigi framework, but will have code from the following lobraries as well:

  • numpy
  • scikit-learn
  • Flask

Run the Code

Local

  1. Install libraries and dependencies: pip install -r requirements.txt
  2. Start the UI server: luigid --background --logdir logs
  3. Navigate with a web browser to http://localhost:[port] where [port] is the port the luigid server has started on (luigid defaults to port 8082)
  4. start the API Server: python app.py
  5. Evaluate Model: python ml-pipeline.py EvaluateModel --input-dir text --lam 0.8
  6. Run evaluation server (at localhost:9191): topmodel/topmodel_server.py
  7. Run the final pipeline: python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8

--

For parallelism, set --workers (note this is Task parallelism):

python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8 --workers 4

Hadoop

  1. Start Hadoop cluster: bin/start-dfs.sh; sbin/start-yarn.sh
  2. Setup Directory Structure: hadoop fs -mkdir /tmp/text
  3. Get files on cluster: hadoop fs -put ./data/text /tmp/text
  4. Retrieve results: hadoop fs -getmerge /tmp/text-count/2012-06-01 ./counts.txt
  5. View results: head ./counts.txt

Flask

  1. docker run -it -v /LOCAL/PATH/TO/REPO/data-engineering-101:/root/workshop clearspandex/pydata-seattle bash
  2. pip2 install flask
  3. ipython2 app.py

Libraries Used

Whats in here?

text/                   20newsgroups text files
topmodel/               Stripe's topmodel evaluation library
example_luigi.py        example scaffold of a luigi pipeline
hadoop_word_count.py    example luigi pipeline using Hadoop
ml-pipeline.py          luigi pipeline covered in workshop
app.py                  Flask server to deploy a scikit-learn model
LICENSE                 Details of rights of use and distribution
presentation.pdf        lecture slides from presentation
readme.md               this file!

The Data

The data (in the text/ folder) is from the 20 newsgroups dataset, a standard benchmarking dataset for machine learning and NLP. Each file in text corresponds to a single 'document' (or post) from one of two selected newsgroups (comp.sys.ibm.pc.hardware or alt.atheism). The first line provides which group the document is from and everything thereafter is the body of the post.

comp.sys.ibm.pc.hardware
I'm looking for a better method to back up files.  Currently using a MaynStream
250Q that uses DC 6250 tapes.  I will need to have a capacity of 600 Mb to 1Gb
for future backups.  Only DOS files.

I would be VERY appreciative of information about backup devices or
manufacturers of these products.  Flopticals, DAT, tape, anything.  
If possible, please include price, backup speed, manufacturer (phone #?), 
and opinions about the quality/reliability.

Please E-Mail, I'll send summaries to those interested.

Thanx in advance,

Resources/References

License

Copyright 2015 Jonathan Dinu.

All files and content licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License

data-engineering-101's People

Contributors

begriffs avatar jonathandinu avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.