Giter Site home page Giter Site logo

mapreduce's Introduction

Project 1: MapReduce on a single server

Collaborators: Alolika Gon, Kinnri Sinha

Runs mapper and reducer on multiple worker processes. Incorporates fault tolerance by restarting a worker process if it is killed.

Application Explanation

Application 1: Inverted Index

This application takes multiple document IDs and words in the document as an input. We want to build an inverted index out of this, i.e., we want to know the mapping of each word to all the documents it is present in.

Input:

Document ID \t Contents i.e. words separated by space

Output:

Word \t Document IDs in which this word is present

Code: src/UDF1.cpp Input File: inputFile1.txt Python Code: invertedIndex.py Outputs from Python execution: true_outputFile1.txt

Application 2: Word Count

The goal of this application is to count the number of word occurences in the document.

Input:

A text document

Output:

Word \t Number of occurences of the word

Code: src/UDF2.cpp
Input File: inputFile2.txt
Python Code: spark_word_count.py
Outputs from Python execution: true_outputFile2.txt

UDF 3: k-mer counter

In bioinformatics, k-mers are substrings of length k contained within a genome sequence containing nucleotides (A, C, T and G). In this application, we find all k-length substrings of a genome sequence and find the number of occurences of each of these sequences. We have taken k=3 for this application.

Input:

A genome sequence containing the 4 nucleotides (A, C, T and G)

Output:

3-mer sequence \t Number of occurences of the 3-mer sequence

Code: src/UDF3.cpp
Input File: inputFile3.txt
Python Code: kmerCount.py
Outputs from Python execution: true_outputFile3.txt

Running automated testing:

testfile.py compiles the code and runs it for different config file attributes: UDF1, UDF2 and UDF3. It also runs the test cases for the three UDFs and test the output against true results generated by python files.

Run testfile.py (Pass command line argument 1 to test fault tolerance otherwise pass 0):
pip install psutil
python testfile.py <0/1>

Three output files will be created: outputFile1.txt, outputFile2.txt, outputFile3.txt for each UDF respectively.

Running the system for one application:

Making changes to config.txt:

config.txt is the config file through which the input file, output file, number of mappers and reducers and the UDF that needs to be run are defined. It is of the following format:

app.inputfilename=inputFile3
app.outputfilename=output_dir/outputFile3
app.N=3
app.class_name=UDF3

Changes to app.inputfilename: Type the name of the file in plaintext. Do not use file extensions or quotes.
Changes to app.outputfilename: Type the name of the file in plaintext after output_dir/. Do not use file extensions or quotes. All files generated will be in .txt format.
Changes to app.class_name: There are 3 choices for this: UDF1/UDF2/UDF3.

Compiling the code:

  1. Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
  2. Run ./mapreduce.exe.

Test fault tolerance by passing 1 as argument:

  1. Run g++ -o mapreduce.exe src/master_fault_tolerance.cpp src/worker.cpp src/UDF1.cpp src/UDF2.cpp src/UDF3.cpp in the main directory to create the .exe file.
  2. Run ./mapreduce.exe and kill_process.py concurrently.

mapreduce's People

Contributors

chhandak1 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.