Hadoop MapReduce on the Cloud

This guide provides step-by-step instructions and results for running two mapreduce jobs on a managed hadoop cluster using Google Dataproc

Prerequiste

Create a dataproc cluster using the cloud console

Establish a SSH connection with the master node by clicking on the cluster and selecting the "vm instances" tab to expose the nodes

Question 1

Here we reate a Hadoop MapReduce application to find the maximum temperature in every day of the years 1901 and 1902 from the National Climate Data Center weather records. The records exit in two files which represent each year.

Finding the Max temperature without a combiner

Upload the map, reduce and data files on the local file system of the master node

Copy the data files from the local file system to HDFS

Run the Map Reduce job using the following command

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file temperature_mapper.py \
-mapper 'python temperature_mapper.py' \
-file temperature_reducer.py \
-reducer 'python temperature_reducer.py' \
-input ./data \
-output /OutputFolder

Collect stats after running the job and check the output folder

Merge the result into a single output file

hadoop fs -getmerge /OutputFolder/ output.txt

Finding the max temperature with a combiner

Upload the map, combiner, reduce and data files on the local file system of the master node

- Copy the data files from the local file system to HDFS

Run the Map Reduce Job with the combiner using the follwing command

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file temperature_mapper.py \
-mapper 'python temperature_mapper.py' \
-file temperature_reducer.py \
-reducer 'python temperature_reducer.py' \
-file temperature_combiner.py \
-combiner 'python temperature_combiner.py' \
-input ./data/ \
-output /OutputFolder

Collect stats after running the job and check the output folder

Question 2

Here we develop an efficient algorithm to find the most frequent 10 words using MapReduce fron a collection of text files. Top-N algorithm structure is shown below.

Upload the map, combiner, reduce and data files on the local file system of the master node

Copy the data files from the local file system to HDFS

Run the Map Reduce Job using the following command with only one reducer

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file top_n_mapper.py \
-mapper 'python top_n_mapper.py' \
-file top_n_reducer.py \
-reducer 'python top_n_reducer.py' \
-input ./data/ \
-output /OutputFolder
-numReduceTasks 1

Collect stats

Get the results

hadoop fs -get /OutputFolder/part-00000 output.txt

okemawo / new-repo1 Goto Github PK

new-repo1's Introduction

Hadoop MapReduce on the Cloud

Prerequiste

Question 1

Finding the Max temperature without a combiner

Finding the max temperature with a combiner

Question 2

ALL DONE!!!

new-repo1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent