A distributed data processing framework in pure Haskell. Inspired by Apache Spark.
An example is worth a thousand words.
This package provides a Dataset
type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.
It uses pluggable Backend
s for spawning executors and ShuffleStore
s for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.
It also exposes a more primitive Control.Distributed.Fork
module which lets you run IO
actions remotely. It is especially useful when your task is embarrassingly parallel.
This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.
Provides Dataset
's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.
-
Clone the repository.
$ git clone https://github.com/utdemir/distributed-dataset $ cd distributed-dataset
-
Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
$ aws configure
-
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
$ aws s3api create-bucket --bucket my-s3-bucket
-
Build an run the example:
-
If you use Nix on Linux:
-
You can use my binary cache on cachix so that you don't recompile half of the Hackage.
-
Then:
$ $(nix-build -A example-gh)/bin/example-gh my-s3-bucket
-
-
If you use stack (requires Docker, works on Linux and MacOS):
$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
-
Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.
I am open to contributions; any issue, PR or opinion is more than welcome.
- You can use
Nix
,cabal-install
orstack
.
If you use Nix:
- You can use my binary cache on cachix so that you don't recompile half of the Hackage.
- 'nix-shell' gives you a development shell with required Haskell dependencies alongside with
cabal-install
,ghcid
andstylish-haskell
. Example:
$ nix-shell --pure --run 'ghcid -c "cabal new-repl distributed-dataset-opendatasets"'
- Use stylish-haskell and hlint:
$ nix-shell --run 'find -name "*.hs" -exec stylish-haskell -i {} \;'
$ nix-shell --run 'hlint .'
- You can generate the Haddocks using
$ nix-build -A docs
- Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.
- Apache Spark.
- Sparkle: Run Haskell on top of Apache Spark.
- HSpark: Another attempt at porting Apache Spark to Haskell.