Giter Site home page Giter Site logo

intro-hadoop-mapreduce's Introduction

Intro to Hadoop and MapReduce from Udacity

Set of scripts written in Python 2.7 for Udacity course Intro to Hadoop and MapReduce.

These scripts can be used locally or in Hadoop by creating MapReduce jobs. To use locally, you can run the following command:

cat example_data.csv | ./example_mapper.py | sort | ./example_reducer.py > results.txt

Lesson 3:

Part 1 - Sales Data

These scripts process data about sales data from different stores: purchases.txt

q1 - Sales per Category: This mapreduce program gives a sales breakdown by product category across all of our stores. Results are output in two columns: (1) category and (2) total sales by category.

q2 - Highest Sale: This mapreduce program finds the monetary value for the highest individual sale for each separate store. Results are output in two columns: (1) store and (2) highest sale.

q3 - Total Sales: This mapreduce program finds the total sales value across all the stores and the total number of sales. Results are output in two rows: (1) total sales and (2) total number of sales.

Part 2 - Web Log Data

These scripts process data about an web server log file: access_log.

q1 - Hits to Page: This mapreduce program finds the number of hits for each different file on the website. Results are output in two columns: (1) file and (2) number of hits.

q2 - Hits from IP: This mapreduce program determines the number of hits to the site made by each different IP address. Results are output in two columns: (1) IP and (2) number of hits.

q3 - Most Popular: This mapreduce program finds the most popular file on the website. Results are output in one row with columns: (1) most popular file and (2) number of occurrences.

Lesson 4:

Project:

These scripts process data about users of and posts on Udacity's forums: forum_node.tsv. To test locally, you can use student_test_posts.tsv

q1 - Student Times: This mapreduce program finds for each student what is the hour during which the student has posted the most posts. Results are output in two columns: (1) user ID and (2) hour with the most posts.

q2 - Post and Answer Length: This mapreduce program calculates the correlation between the length of a post and the length of answers. Results are output in three columns: (1) question ID, (2) question length, and (3) average answers length.

q3 - Top Tags: This mapreduce program finds the top tags used in posts. Results are output in two columns, sorted by popularity: (1) tag and (2) number of posts with the tag.

q4 - Study Groups: This mapreduce program creates a list of the users who interacted via a question. Results are output in two columns: (1) question ID and (2) list of user IDs.

intro-hadoop-mapreduce's People

Contributors

cris3w avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.