Giter Site home page Giter Site logo

pentateu / ipld-eml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rtradeltd/ipld-eml

0.0 1.0 0.0 41.7 MB

An RFC-5322 compatible email parser that stores data on IPFS

License: GNU Affero General Public License v3.0

Makefile 4.22% Go 95.78%

ipld-eml's Introduction

ipld-eml

ipld-eml is an RFC-5322 compliand IPLD object format for storing email messages, in both a space efficient, and time efficient manner. TemporalX is used as the interface into IPFS. Emails are converted into a protocol buffer object, before being stored onto IPFS. There are currently two methods for storing the IPLD objects:

  • Entirely as a UnixFS object
  • Chunked into 1MB blocks, with all blocks wrapped in a single unixfs object.

This repository also includes a CLI tool enabling you to convert emails manually, or generate fake emails

data format overview

unixfs workflow

  • Email is converted into protocol buffer object
  • Protocol buffer object is saved onto IPFS as a unixfs object

chunked workflow

  • Email is converted into protocol buffer object
  • Object is serialzed
  • Chunks the serialized byte slice into slight under 1MB in size
  • Store byte slice on IPFS as a block
  • Create a protocol buffer "chunked email" object
  • Store a map of chunk number -> block hash
  • Store chunked email object on ipfs as a unixfs object (done to avoid possible isuses with store protocol buffer object directly being larger than 1MB)

The chunked method has a very minor overhead compared to the pure unixfs object, but enables more fine-grained distribution of chunks across nodes in the network

samples

To reliably estimate space savings, and performance there is a set of sample emails included in the repository in the samples directory. The root of the samples directory contains emails I've sent to myself as a initial test dataset, and an email I received from a newsletter. The samples/generated directory contains 5000 emails randomly generated with the analysis package. The samples contained here contain highly duplicated data. It is meant to showcase a best case space savings example.

Overview of the various samples:

sample1.eml is a basic email message with no attachments sample2.eml is an email message with an attachment sample3.eml is sample2.eml but forwarded to myself sample4.eml is a few replies to sample3.eml and sending the same image back sample5.eml is a few replies to sample4.eml with roughly 1.6MB in attachments/embedded files sample6.eml is a reply to sample5.eml with CC+BCC, and more files sample7.eml is a reply to sample6.eml but with samples 1 -> 6 attached sample8.eml is an email i received from the golang weekly mailing list

samples (generated)

The generated directory contains 5000 emails generated with the fake email generator in the analysis package. Each email has a randomly generated 720x720 image attached to it, as well as one emoji per paragraph, with a total of 100 paragraphs.

The following command was used to generate the data:

$> ./eml-util gfe --paragraph.count 100 --email.count 5000 --emoji.count 100 --outdir=samples/generated

Size on disk is about 144MB and size on IPFS is about 133MB which gives us about an 8% space savings on average

space savings

The non-generated samples are intended to show-case a best case space savings about, when there is predominantly duplicated data. In the non-generated samples, the bulk of the data is composed of the same picture, meant to simulate a situation where the same photo (ex: cat picture) is sent to many different people.

The generated samples are inteded to show-case the "average/worst-case" space savings, where the deduplication is largely derived from emails sharing a small subset of their total chunks due to the way content-addressing works, as opposed to savings from not having to store the same data/file once when that data composes most of the emails.

Sample Set IPLD Format Number Of Emails IPFS Size Disk Size Space Savings Scenario
Real Pure UnixFS 8 1.93MB 11MB 578% Best Case (lots of duplicated emails + images)
Generated Pure UnixFS 5000 133MB 144MB 8% Worst Case (virtually no duplicated emails and images)

At face value the worst case savings of 8% might not seem like much. However if we extrapolate to larger data sizes even with 8% savings it makes a huge difference.

Scenario Disk Size IPFS Size Space Saved (no raid / raid-0) Space Saved (raid-1)
Best 20PB 3.46PB 16.54PB 33.08PB
Best 20GB 3.46GB 16.54GB 33.08GB
Best 20MB 3.46MB 16.54MB 33.08MB
Worst 20PB 18.4PB 1.6PB 3.2PB
Worst 20GB 18.4GB 1.6GB 3.2GB
Worst 20MB 18.4MB 1.6MB 3.2MB

Even at 20PB, saving 1.6PB amounts to significant real world financial savings, which when you're operating at that scale of storage is huge. Massive email stores, and archives aren't just taking 20PB and using a bunch of cheap Western Digital disks without any redundancy. They're using enterprise grade hard drives which in and off itself is expensive, but there also using things like RAID, zRAID, etc... which amplifies the space savings even more.

cli usage

fake email generation

$> eml-util generate-fake-emails --paragraph.count 100 --email.count 5000 --emoji.count 100 --outdir=samples/generated
$> eml-util gen-fake-emails --paragraph.count 100 --email.count 5000 --emoji.count 100 --outdir=samples/generated
$> eml-util gfe --paragraph.count 100 --email.count 5000 --emoji.count 100 --outdir=samples/generated

converts emails to ipld eml objects

$> eml-util --email.dir=samples/generated/10k convert
$> eml-util --email.dir=samples/generated/10k con
$> eml-util --email.dir=samples/generated/10k c

Benchmarking

$> eml-util benchmark
$> eml-util bench
$> eml-util b

ipld-eml's People

Contributors

bonedaddy avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.