Hadoop MapReduce implementation of Market Basket Analysis for Frequent Item-set and Association Rule mining using Apriori algorithm.
This Big Data project is a simple working model of Market Basket Analysis. This project is implemented using Hadoop MapReduce framework. Basically this project runs multiple MapReduce jobs to produce the final output. This project uses K-Pass Apriori algorithm for frequent item-sets mining followed by association rule mining to generate all the valid Rules and their corresponding measures such as Support, Confidence and Lift. The frequent item-sets are obtained using a threshold Support and the Rules are validated using a threshold Confidence. Duplicate, reverse and redundant rules are removed to produce interesting and useful rules only. These list of Rules sorted by consequent (RHS of the association) first and then by Lift is the final output of this project. The entire process of building and running this project has been automated using Gradle. Check the Usage section for more details.
Make sure you have the following list of dependencies for this project installed and setup on your system first:
- Linux Operating System
- Java JDK 1.8+
- Hadoop 2.7.1+
- Gradle 4.0.1+
First download the project as zip archive and extract it to your desired location or just clone the repository using,
$ git clone https://github.com/pranitbose/market-basket-analysis.git
This project uses Hadoop 2.7.3 by default. If you have an older release of Hadoop installed then you can update the jar dependencies of the project to your current Hadoop release version. For example if you have Hadoop 2.7.1 installed then you need to edit the build.gradle file changing the following lines to match your current version. Change it from 2.7.3 to 2.7.1,
dependencies {
compile 'org.apache.hadoop:hadoop-common:2.7.1'
compile 'org.apache.hadoop:hadoop-mapreduce-client-core:2.7.1'
compile 'org.apache.hadoop:hadoop-core:1.2.1'
}
NOTE: This project is not tested for older releases of Hadoop below 2.7.1 and it is recommended not to use an older Hadoop release to avoid compatibility issues.
Please refer to this documentation Gradle Install to install and setup Gradle on your system.
Make sure that you start your node cluster first by using start-dfs.sh
followed by start-yarn.sh
. You can check whether the deamons are running using jps
command.
You can change the configuration paramters to run this project with your specified settings. All the configurable parameters to customize is available in the config file. Edit this file to update default values of the paramters to your desired values. # denotes comments in this file which has been provided for your information. Some of this parameters will be passed as command line arguments to the jar file while running this project while others will be used to automate the running process. Multiple transaction datasets has been put under dataset/ directory. If you want to use your own dataset just copy that file into this directory mentioned and change the dataset name parameter in the config file to the one you will be using. If you want to run the project with different minimum Support or Confidence value then you have to change the value and save the file before next run.
NOTE: Please don't change the order of the paramters in the config file. Keep the order intact as it is else you will run into errors while running this project. You can add additional comments if you want prefixing with #
Since the entire process of running this project has been automated you can simply straightaway get the task rolling. Move into the project folder and do the following,
Compile and Build the jar
cd market-basket-analysis
gradle build
Run the jar
You can run the project like this:
$ hadoop jar ./build/libs/mba.jar <inp_dir> <out_dir> <min_sup (0.0-1.0)> <min_conf (0.0-1.0)> <txns_count> <delimiter> <max_pass> <filterbylift (0|1)>
- inp_dir: path to the input dataset in HDFS
- out_dir: path to the output directory in HDFS which will store all the intermediate and final results.
- min_sup: minimum support value for frequent item-set mining. The value should be in the range 0.0 - 1.0
- min_conf: minimum confidence value for association rule mining. The value should be in the range 0.0 - 1.0
- txns_count: total number of transactions in the input dataset. Value should not be 0.
- delimiter: literal used in the dataset to separate multiple items in a single line of transaction. For a .csv file , will be the separator. If the the separator is whitespace then use qoutes to enclose it like this " "
- max_pass: maximum number of iterations you want the Apriori algorithm to run for. A value of 5 will find all frequent item-sets of size upto 5 if possible given the threshold support specified above.
- filterbylift: a value of 1 will filter all the rules by positive lift percentage and final output will only contain rules with lift > 1.0 otherwise a value of 0 will output all the rules irrespective of the lift value.
For example:
hdfs dfs -mkdir -p /input
hdfs dfs -mkdir -p /output
# Put the dataset groceries.csv in the market-basket-analysis folder
# Move into the project folder and do the following
cd market-basket-analysis
hdfs dfs -copyFromLocal groceries.csv /input
hadoop jar ./build/libs/mba.jar /input /output 0.02 0.4 9835 "," 10 1
Get the result:
hdfs dfs -cat /output/final-output/*
This project is licensed under the terms of the MIT license.