Giter Site home page Giter Site logo

aycanirican / distributed-dataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from utdemir/distributed-dataset

0.0 2.0 0.0 563 KB

A distributed data processing framework in pure Haskell.

License: BSD 3-Clause "New" or "Revised" License

Nix 3.25% Haskell 96.54% Shell 0.22%

distributed-dataset's Introduction

distributed-dataset

Documentation Build Status

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

An example is worth a thousand words.

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

  • Clone the repository.

    $ git clone https://github.com/utdemir/distributed-dataset
    $ cd distributed-dataset
  • Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:

    $ aws configure
  • Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:

    $ aws s3api create-bucket --bucket my-s3-bucket
  • Build an run the example:

    • If you use Nix on Linux:

      • You can use my binary cache on cachix so that you don't recompile half of the Hackage.

      • Then:

        $ $(nix-build -A example-gh)/bin/example-gh my-s3-bucket
    • If you use stack (requires Docker, works on Linux and MacOS):

      $ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

Hacking

  • You can use Nix, cabal-install or stack.

If you use Nix:

  • You can use my binary cache on cachix so that you don't recompile half of the Hackage.
  • 'nix-shell' gives you a development shell with required Haskell dependencies alongside with cabal-install, ghcid and stylish-haskell. Example:
$ nix-shell --pure --run 'ghcid -c "cabal new-repl distributed-dataset-opendatasets"'
  • Use stylish-haskell and hlint:
$ nix-shell --run 'find -name "*.hs" -exec stylish-haskell -i {} \;'
$ nix-shell --run 'hlint .'
  • You can generate the Haddocks using
$ nix-build -A docs

Related Work

Papers

Projects

distributed-dataset's People

Contributors

utdemir avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.