distributed-dataset

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

Clone the repository.

$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset

Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
```
$ aws configure
```
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
```
$ aws s3api create-bucket --bucket my-s3-bucket
```
Build an run the example:
- If you use Nix on Linux:
  - You can use my binary cache on cachix so that you don't recompile half of the Hackage.
  - Then:
```
$ $(nix-build -A example-gh)/bin/example-gh my-s3-bucket
```
- If you use stack (requires Docker, works on Linux and MacOS):
```
$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
```

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

Hacking

You can use Nix, cabal-install or stack.

If you use Nix:

You can use my binary cache on cachix so that you don't recompile half of the Hackage.
'nix-shell' gives you a development shell with required Haskell dependencies alongside with cabal-install, ghcid and stylish-haskell. Example:

$ nix-shell --pure --run 'ghcid -c "cabal new-repl distributed-dataset-opendatasets"'

Use stylish-haskell and hlint:

$ nix-shell --run 'find -name "*.hs" -exec stylish-haskell -i {} \;'
$ nix-shell --run 'hlint .'

You can generate the Haddocks using

$ nix-build -A docs

Related Work

Papers

Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

Projects

Apache Spark.
Sparkle: Run Haskell on top of Apache Spark.
HSpark: Another attempt at porting Apache Spark to Haskell.

aycanirican / distributed-dataset Goto Github PK

distributed-dataset's Introduction

distributed-dataset

Packages

distributed-dataset

distributed-dataset-aws

distributed-dataset-opendatasets

Running the example

Stability

Contributing

Hacking

Related Work

Papers

Projects

distributed-dataset's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent