Giter Site home page Giter Site logo

haptork / easylambda Goto Github PK

View Code? Open in Web Editor NEW
494.0 35.0 43.0 2 MB

distributed dataflows with functional list operations for data processing with C++14

Home Page: https://haptork.github.io/easyLambda/

License: Boost Software License 1.0

Makefile 0.51% C++ 98.66% CMake 0.83%
dataflow-programming mpi parallel functional-programming distributed-computing hpc cpp14

easylambda's Introduction

Build Status codecov

ezl: easyLambda

Parallel data processing made easy using functional and dataflow programming with modern C++

Welcome to easyLambda and thanks for your interest. The site aims to be a comprehensive guide for easyLambda.

What is easyLambda

EasyLambda is header only C++14 library for data processing in parallel with functional list operations (map, filter, reduce, scan, zip) that are tied together in type--safe dataflow.

EasyLambda is parallel, it scales from multiple cores to hundreds of distributed nodes without any need to deal with parallelism in user code.

EasyLambda is fast. It has minimal overhead in serial execution and builds upon high performance MPI parallelism that is known to be more efficient than any other comparable work [1].

EasyLambda is expressive and succinct, thanks to the column selection for composition of functions and many generic algorithms such as configurable parallel file reader, predicates, correlation, summary etc.

EasyLambda is intuitive and easy to understand with its uniform property based (or ExpressionBuilder) interface for everything from configuring parallelism to changing behavior of generic algorithms to routing dataflow.

EasyLambda is easily interoperable with other libraries like standard library or raw MPI code, since it uses standard data types and enforces no special structure, data-types or requirements on the user functions.

Why easyLambda

EasyLambda is a good fit for the following tasks:

  • table/list processing and analysis from CSV or flat text files.
  • post-processing of scientific simulation results.
  • running iterative machine learning algorithms.
  • parallel type-safe data reading.
  • to play with dataflow programming and functional list operations.

Since, it can smoothly interoperate with other libraries, it is possible to add distributed parallelism using easyLambda to the existing libraries or codebase when its programming abstraction fits well e.g. it can be used along with bare MPI code or with a machine learning library to add distributed training and testing.

EasyLambda will also interest you if you

  • are a modern C++ enthusiast
  • want to dabble with metaprogramming
  • like functional and dataflow programming
  • have cluster resources that you want to put to use in everyday tasks without much effort.
  • have always wanted a high-level MPI interface.

Benchmarks

EasyLambda combines the efficiency of MPI with a high level programming abstraction. With easyLambda you get easy to understand code with good run-time performance. Check out the benchmarks and comparisons for performance and ease of use.

benchmarks

Getting Started

Check out the Getting Started section of the library webpage to know how to install and begin with easyLambda. The library can also be used on aws elastic cloud or single instance.

Examples

A detailed walkthrough of the library is given here, The examples directory contains various examples and demonstrations with explanations of features and options.

Here we mention some examples in short.

The following program calculates frequency of each word in the data files.

auto reader = fromFile<string>(argv[1]).rowSeparator('s').colSeparator("");
ezl::rise(reader)
  .reduce<1>(ezl::count(), 0).dump()
  .run();

The dataflow pipeline starts with rise and subsequent operations are added to it. In the above example, the pipeline begins by reading in data from the specified file(s). fromFile is a library function that takes column types and the specified file(s) glob pattern as input and reads the file(s) in parallel. It has a lot of properties for controlling data-format, parallelism, denormalization etc (shown in demoFromFile).

In reduce we pass the index of the key column to group by, the library function for counting and initial value of the result.

Following is a dataflow for calculating pi using Monte-Carlo method.

ezl::rise(ezl::kick(10000)) // 10000 trials shared over all processes
  .map([] { 
    return pow(rnd(), 2) + pow(rnd(), 2);
  })
  .filter(ezl::lt(1.))
  .reduce(ezl::count(), 0)
  .map([](int inCircleCount) { 
    return (4.0 * inCircleCount / 10000); 
  }).dump()
  .run();

The dataflow starts with rise in which we pass a library function to call the next unit a number of times. The steps in the algorithm have been expressed with the composition of small operations, some are common library functions like count(), lt() (less-than) and some are user-defined functions specific to the problem.

Here is another example from cods2016. A stripped version of the input data-file is given with ezl here. The data contains student profiles with scores, gender, job-salary, city etc.

auto scores = ezl::fromFile<char, array<float, 3>>(fileName)
                .cols({"Gender", "English", "Logical", "Domain"})
                .colSeparator("\t");

ezl::rise(scores)
  .filter<2>(ezl::gtAr<3>(0.F))   // filter valid domain scores > 0
  .map<1>([] (char gender) {      // transforming with 0/1 for isMale
    return float(gender == 'm');
  }).colsTransform()
  .reduceAll(ezl::corr<1>())
    .dump("", "Corr. of gender with scores\n(gender|E|L|D)")
  .run();

The above example prints the correlation of English, logical and domain scores with respect to gender. We can find similarity of the above code with steps in a spreadsheet analysis or with SQL query. We select the columns to work with viz. gender and three scores. We filter the rows based on a column and predicate. Next, we transform a selected column in-place and then find an aggregate property (correlation) for all the rows.


Contributing

Suggestions and feedback are welcome. Feel free to contact via mail or issues for any query.

Some of the possible directions of improvement:

  • compile time optimization
  • use of specialized data structures in various units like reduce etc.
  • addition of more examples e.g. neural nets, simulations etc.
  • design simplifications
  • parallelism optimization
  • code reviews
  • documentation

Possible ideas for future extenstions:

  • fault tolerance
  • algorithms / functions to plot streaming and buffered data
  • domain specific algorithms
  • MPI single-sided communications
  • Experiments to extend current programming abstraction to cover more problems like domain-decomposition etc.

Check internals and blog for design and implementation details.

Acknowledgments

A big thanks to cppcon, meetingc++ and other conferences and all C++ expert speakers, committee members and compiler implementers for modernising C++ and teaching it with so much enthusiasm. I had fun implementing this, hoping you will have fun using it. Looking forward to learn more from the community.

I wish to thank eicossa and Nitesh for their (less online, more offline :P) contributions.

easylambda's People

Contributors

eicossa avatar haptork avatar utkarshj1303 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easylambda's Issues

Plots and animations

Addition of UDFs in algorithms that plot and animate rows as they stream in a map unit, possibly using Cinder.

Since there is a good support for loading data and column selection, this can be quite useful and fits naturally.

However, it goes against the fact that plotting and analysis are two separate things, so I don't know how good of an idea this is.

Don't std::move local return values

It's a bad idea to apply std::move when returning a local object. Scott Meyer says: "Never apply std::move or std::forward to local objects if they would otherwise be eligible for the return value optimization" (Effective Modern C++)

The idea is that the compiler can optimize away the copy if you return by value, but if it can't, it must move. If you apply std::move then it has to move in either way. You're preventing RVO from happening.

I saw this in a few places around the code: https://github.com/haptork/easyLambda/search?q="return+std%3A%3Amove"&type=Code

Running on Windows/MSVC 2015

Hi - Thank you for the excellent library. Looks very interesting. Do you have any plans of adding instructions for pre-requisites and compilation instructions on Windows (e.g., using MSVC 2015)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.