Giter Site home page Giter Site logo

aws_emr's Introduction

AWS_EMR -Map Reduce

Learn to run a Map Reduce program(using Python) in the Amazon cloud EMR service

We will use AWS EMR on the Enron email dataset: http://aws.amazon.com/datasets/enron-email-data/ https://en.wikipedia.org/wiki/Enron_scandal This dataset contains 1,227,255 eMails from Enron employees. The version we use consists of 50 GB of compressed files.

Input files - Accessing the enron data http://s3.amazonaws.com/enron-scripts/enron-urls-small.txt (smaller set) http://s3.amazonaws.com/enron-scripts/enron-urls.txt (Larger set)

  • Step-1: From the above input files, we extract the required data(DateTimestamp, sender, recipient)

  • Step-2: The output of Step-1 is passed to the Enron-Wordcount-Mapper.py, which selects the following records Filter the data to - only consider emails between 2001-09-05 and 2001-09-08 (including) - only consider messages going from ENRON employees to someone not part of the organization - Count the number of such foreign interactions and only include accounts (senders) that have more than one outside contact that week.

  • Step-3: The output of the mapper function will be passed to the Enron-Wordcount-Reducer.py

  • Step-4: Merges the output file(s) from Reducer jobs into one single file, using s3distcp(s3-dist-cp in AWS EMR) Add a step in AWS EMR, JAR location "command-runner.jar" and in the 'Arguments' give the following: s3-dist-cp --src=s3://bucket-name/folder --dest=s3://bucket-name/folder2 --groupBy=.(to_be_grouped_by). Note: The groupBy argument is quite useful, The files will be grouped by the wildcards(regexp) inside the parenthesis. Refer http://mlpebbles.blogspot.nl/2014/03/note-to-myself-on-s3distcp.html

Instructions on how to use Amazon EMR http://homepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-BDIT-2017/MapReduceEnron.pdf

Tips: Before running the code in AWS EMR, try running the python code locally in your laptop. It would be easier to debug locally than in EMR python Enron-Wordcount-Mapper-Details.py < part-000000 | python sort.py > temp0.out python Enron-Wordcount-Reducer.py < temp0.out > final0.out

When the code is run in AWS EMR, the output of the mapper step is automatically shuffled and sorted before passed as input to the reducer job(s). However when we run it locallyin the laptop, we need to sort it explicitly. could use the sort.py for that purposes.

Athena

I thought of checking the results of the Python Map Reduce scripts(read the Readme file to know more details on this), by querying the ETL results(several files) which is stored in AWS S3. Came across Athena - It is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Check out the Wiki page for details

References https://www.tutorialspoint.com/python/time_strptime.htm

How to install dateutil package in Python? http://stackoverflow.com/questions/879156/how-to-install-python-dateutil-on-windows python -m pip install python-dateutil

aws_emr's People

Contributors

jamespaultg avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.