This guide provides step-by-step instructions and results for running two mapreduce jobs on a managed hadoop cluster using Google Dataproc
- Create a dataproc cluster using the cloud console
- Establish a SSH connection with the master node by clicking on the cluster and selecting the "vm instances" tab to expose the nodes
Here we reate a Hadoop MapReduce application to find the maximum temperature in every day of the years 1901 and 1902 from the National Climate Data Center weather records. The records exit in two files which represent each year.
- Upload the map, reduce and data files on the local file system of the master node
- Copy the data files from the local file system to HDFS
- Run the Map Reduce job using the following command
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file temperature_mapper.py \
-mapper 'python temperature_mapper.py' \
-file temperature_reducer.py \
-reducer 'python temperature_reducer.py' \
-input ./data \
-output /OutputFolder
- Collect stats after running the job and check the output folder
- Merge the result into a single output file
hadoop fs -getmerge /OutputFolder/ output.txt
- Upload the map, combiner, reduce and data files on the local file system of the master node
- Run the Map Reduce Job with the combiner using the follwing command
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file temperature_mapper.py \
-mapper 'python temperature_mapper.py' \
-file temperature_reducer.py \
-reducer 'python temperature_reducer.py' \
-file temperature_combiner.py \
-combiner 'python temperature_combiner.py' \
-input ./data/ \
-output /OutputFolder
- Collect stats after running the job and check the output folder
Here we develop an efficient algorithm to find the most frequent 10 words using MapReduce fron a collection of text files. Top-N algorithm structure is shown below.
- Upload the map, combiner, reduce and data files on the local file system of the master node
- Copy the data files from the local file system to HDFS
- Run the Map Reduce Job using the following command with only one reducer
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-file top_n_mapper.py \
-mapper 'python top_n_mapper.py' \
-file top_n_reducer.py \
-reducer 'python top_n_reducer.py' \
-input ./data/ \
-output /OutputFolder
-numReduceTasks 1
- Collect stats
- Get the results
hadoop fs -get /OutputFolder/part-00000 output.txt