This project aims to analyse the Hadoop map-reduce framework and explore following problems:
- Count word frequency for a given dataset
- Top K frequently occuring words in N files
We explore these problems for different datasets and different implementations of Map-Reduce program, and establish a comparitive analysis.
- Ensure that Hadoop is installed. We use single-node setup for the execution of the above programs. More Details: Single Node Setup.
We evaluate map-reduce word count program on three datasets:
- Quora Insincere Questions Classification Dataset:~100K records
- IMDB Dataset of 50K Movie Reviews:~50K records
- SMS Spam Collection Dataset:~5k records
- First, create an input directory in HDFS
hadoop fs -mkdir /quora_dataset
hadoop fs -mkdir /quora_dataset/input
- Add file_quora_dataset.txt (containing 100K records) inside the input directory in HDFS.
hadoop fs -put './datasets/file_quora_dataset.txt' /quora_dataset/input
- Run ./word_count/word_count_classes/word_count.jar and store the results in an output directory in HDFS.
hadoop jar ./word_count/word_count_classes/word_count.jar WordCount /quora_dataset/input /quora_dataset/output
- Installl the IMBD dataset from the link and create a txt file (datasets/file_imdb_dataset.txt) of all the 100K reviews. Add file_imdb_dataset.txt (containing 100K records) inside the input directory in HDFS.
hadoop fs -mkdir /imdb_dataset
hadoop fs -mkdir /imdb_dataset/input
hadoop fs -put './datasets/file_imdb_dataset.txt' /imdb_dataset/input
hadoop jar ./word_count/word_count_classes/word_count.jar WordCount /imdb_dataset/input /imdb_dataset/output
hadoop fs -mkdir /spam_dataset
hadoop fs -mkdir /spam_dataset/input
hadoop fs -put './datasets/file_spam_dataset.txt' /spam_dataset/input
hadoop jar ./word_count/word_count_classes/word_count.jar WordCount /spam_dataset/input /spam_dataset/output
We take 20 large text files given in the link and add it to the folder 20_Files. Our aim is to calculate top 100 most frequently occuring words in these files. We have 3 implementations for this problem.
This is a basic solution. We simply implement the word count program for our 20 files given above and find the top 100 most frequntly occuring words using the command:
hadoop fs -mkdir /Top_100
hadoop fs -mkdir /Top_100/input
hadoop fds -put './20_Files' /Top_100/input
hadoop jar ./word_count/word_count_classes/word_count.jar WordCount /Top_100/input/20_Files /Top_100/output
hadoop fs -cat /top_100_words/output/* | sort -n -k2 -r | head -n100
We map all words in 20 files to <1,w1>, <1,w2>, <1,w3>, <1,w4>..... <1,wn>. The reducer receives <1,[w1,w2,w3....wn]>. We maintain a HashMap in the reducer to store the count of all the words in the form <w1,c1>,<w2,c2>,<w3,c3>...<wn,cn>. Then, we add all the entries of the hashmap to a treemap in the form <c1,w1> <c2,w2> <c3,w3>...<cn,wn> to sort the entries according to their count. Finally, we write the top 100 values from the treemap.
hadoop fs -mkdir /Top_100_part_2
hadoop fs -mkdir /Top_100_part_2/input
hadoop fs -put './20_Files' /Top_100_part_2/input
hadoop jar ./top_100/top_100_classes/top_100_count.jar Top100WordCountDriver /Top_100_part_2/input/20_Files /Top_100_part_2/output
We first map all the words as <w1,1>, <w2,1>,<w3,1> ...<wn,1>. At reducer, we receive <w1,[1,1,1]>,<w2,[1,1]> ...<wn,[1,1..]> where c11, c12, c13.Further, we mainitain a TreeMap of size 100 inside the reducer context. As we evaluate the word count at reducer <wk, Ck>, we add <Ck,wk> to the tree map. As soon as the size of TreeMap crosses 100, we pop the minimum element. This way we are left with the top 100 elements in the TreeMap after proceesing. Finally, during context cleanup, we write all the elements from our treemap.
hadoop fs -mkdir /Top_100_part_3
hadoop fs -mkdir /Top_100_part_3/input
hadoop fs -put './20_Files' /Top_100_part_3/input
hadoop jar ./top_100/top_100_classes_part_2/top_100_count.jar TopKWordCountDriver /Top_100_part_3/input/20_Files /Top_100_part_3/output
Analysis report for the above tasks, including screenshot of the outputs can be found here
Following link was helpul while completing the project HADOOP OFFICIAL TUTORIAL