Giter Site home page Giter Site logo

enronsearch's Introduction

Converting the Enron Email Dataset to mbox Format

The Enron Email Dataset is distributed in maildir format, which means that each message is stored in a separate file. This is unwieldy to work with. Here's how you can convert maildir into mbox, where all messages in a folder are stored in a single mbox file.

Go fetch the dataset and then unpack:

$ tar xvfz enron_mail_20150507.tgz

The dataset should unpack into a directory called maildir. Use the script count_messages.sh to gather an inventory of the messages in each folder:

$ ./count_messages.sh

Verify the total number of messages in the dataset:

$ ./count_messages.sh | cut -d' ' -f1 | awk '{s+=$1} END {print s}'
517401

Now run the conversion script:

$ ./convert_enron_to_mbox.py

It might take a bit, so go grab a cup of coffee...

Note that the script is destructive, in that it alters the original structure of the dataset. This is necessary to get everything in the right maildir format so that it can be processed by Python tools (in particular, the script creates cur/ and new/ directories, which is part of the expected layout).

After the script completes, the resulting mbox files are stored in the enron/ directory:

$ ls enron | wc
    3311    3311   93804

The repo includes ReadMbox.java, a very simple Java program that uses the JavaMail API to read the mbox files. The dependent jars are checked into the repo for convenience, so you can compile directly:

$ javac -cp lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox.java

You can now examine a particular mbox file:

$ java -cp .:lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox enron/enron.allen-p._sent_mail

The program prints out the subject line of each email.

To verify the integrity of the entire dataset in mbox format, run:

$ ./verify_mbox.sh > mbox.log &

Confirm that the number of messages is exactly the same:

$ cut -d' ' -f3 mbox.log | awk '{s+=$1} END {print s}'
517401

enronsearch's People

Contributors

snutesh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.