Hadoop Twitter Triangle Count using partition join Example code for CS6240 Fall 2019
Liang Xue
These components are installed:
- JDK 1.8
- Hadoop 2.8.5
- Maven
- AWS CLI (for EMR execution)
- Example ~/.bash_aliases: export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
export HADOOP_PREFIX=/usr/local/bin/hadoop-2.8.5 # Change this to where you unpacked hadoop to. export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX
export YARN_HOME=/usr/local/bin/yarn export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
- Explicitly set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh: export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
All of the build & execution commands are organized in the Makefile.
- Unzip project file.
- Open command prompt.
- Navigate to directory where project files unzipped.
- Edit the Makefile to customize the environment at the top. Sufficient for standalone: hadoop.root, jar.name, local.input Other defaults acceptable for running standalone.
- Standalone Hadoop: make switch-standalone -- set standalone Hadoop environment (execute once) make local
- Pseudo-Distributed Hadoop: (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation) make switch-pseudo -- set pseudo-clustered Hadoop environment (execute once) make pseudo -- first execution make pseudoq -- later executions since namenode and datanode already running
- AWS EMR Hadoop: (you must configure the emr.* config parameters at top of Makefile) make upload-input-aws -- only before first execution make aws -- check for successful execution with web interface (aws.amazon.com) download-output-aws -- after successful execution & termination