Giter Site home page Giter Site logo

enron-network-using-mapreduce-and-r's Introduction

Network analysis of Enron emails

Enron Network over Time Based on Sent Emails | Date Range: 1998-11->2001-04

What it is about?

We explore the social network aspect of the Enron Email dataset. The goal is to see how a network of people behaved in a company that was caught doing fraud and how the knowledge flowed through the network. Additionally, this project is a great way to learn about dealing with large unstructured data using the mapreduce concept and open source tools like hadoop virtual machine and R.

We use hadoop and shell as an alternative to transform the semi-unstructured email data into something we can work with. Using this data we will be able to visualize the social network grouped by the metrics such as edge betwenness, centrality index, etc.

The most challenging part of this project was coming up with rules to extract the metrics we wanted from the emails using mapreduce. We have done this with regular expressions, python and unix shell/hadoop. Additionally, it so happened that one worker node is much slower than shell when executing the mapreduce jobs. We provide examples in the code section for both shell and hadoop. We ran two mapreduce jobs in shell and they completed in around 25 minutes each.

R Network Analysis

Links to R analysis with code and without code.

Code

Upload Enron data to hadoop

Assuming the enron-emails dataset folder is in the data folder. You can download the Enron Email Dataset from this link. Commands are executed from current (enron-network-using-mapreduce-and-R/) directory.

For hadoop we are using CDH 5.5 virtual machine by Cloudera.

First, you will have to run emails-rename.sh shell script to give the email files unique names instead of repeating numbers. Refer to this README.md in shell-scripts/ for more info.

mkdir data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails sent data/enron-emails-sent
sh shell-scripts/emails-rename.sh data/enron-emails inbox data/enron-emails-inbox

Now, we can upload uniquely named files to hdfs.

hadoop fs -mkdir enron-sent enron-inbox
hadoop fs -put data/enron-emails-sent/* enron-sent
hadoop fs -put data/enron-emails-inbox/* enron-inbox

Mapreduce

Refer to README.md in mapreducers/ to find more info about the mappers and reducers. They can be executed in shell or using hadoop streaming API. Below are examples for the first mapreduce job. To execute the other one change the digit in the names of the mapper and reducer from 1 to 2.

Hadoop

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output conns-sent -file mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output conns-inbox mapper1.py -file reducer1.py \
-mapper mapper1.py -reducer reducer1.py

# download results
hadoop fs -cat conns-sent > conns-sent.txt
hadoop fs -cat conns-inbox > conns-inbox.txt

Edit the reducer1.py file. Comment out lines 37 and 43. Uncomment lines 38 and 44.

For below commands I just changed the output filename to n-conns-inbox.txt.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-sent \
-output n-conns-sent -file mapper2.py -file reducer2.py \
-mapper mapper1.py -reducer reducer2.py

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input enron-inbox \
-output n-conns-inbox mapper2.py -file reducer2.py \
-mapper mapper2.py -reducer reducer2.py

# download results
hadoop fs -cat n-conns-sent > n-conns-sent.txt
hadoop fs -cat n-conns-inbox > n-conns-inbox.txt

Shell

find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt

find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/conns-inbox.txt

Same as for the hadoop usage.

find ../data/enron-emails-inbox/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt

find ../data/enron-emails-sent/* -name '*.' -exec cat {} \; | python mapper1.py | sort | python reducer1.py > ../data/n-conns-inbox.txt

enron-network-using-mapreduce-and-r's People

Contributors

kobakhit avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

vivekvghelani

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.