Giter Site home page Giter Site logo

ucbrise / jarvis Goto Github PK

View Code? Open in Web Editor NEW
13.0 10.0 8.0 94.78 MB

Build, configure, and track workflows with Jarvis.

Home Page: https://ucbrise.github.io/jarvis

License: Apache License 2.0

Python 100.00%
jarvis workflow replication reproduction data-scientists

jarvis's Introduction

Jarvis (has been renamed to Flor please use that repo instead)

Build, configure, and track workflows with Jarvis.

What is Jarvis?

Jarvis is a system with a declarative DSL embedded in python for managing the workflow development phase of the machine learning lifecycle. Jarvis enables data scientists to describe ML workflows as directed acyclic graphs (DAGs) of Actions and Artifacts, and to experiment with different configurations by automatically running the workflow many times, varying the configuration. To date, Jarvis serves as a build system for producing some desired artifact, and serves as a versioning system that enables tracking the evolution of artifacts across multiple runs in support of reproducibility.

How do I run it?

Clone or download this repository.

You'll need Anaconda, preferably version 4.4+

Please read this guide to set up a Python 3.6 environment inside Anaconda. Whenever you work with Jarvis, make sure the Python 3.6 environment is active.

Once the Python 3.6 environment in Anaconda is active, please run the following command (use the requirements.txt file in this repo):

pip install -r requirements.txt

Next, we will install RAY, a Jarvis dependency:

brew update
brew install cmake pkg-config automake autoconf libtool boost wget

pip install numpy funcsigs click colorama psutil redis flatbuffers cython --ignore-installed six
conda install libgcc

pip install git+https://github.com/ray-project/ray.git#subdirectory=python

Next, Add the directory containing this jarvis package (repo) to your PYTHONPATH.

For examples on how to write your own jarvis workflow, please have a look at:

examples/twitter.py -- classic example
examples/plate.py -- multi-trial example

Make sure you:

  1. Import jarvis
  2. Initialize a jarvis.Experiment
  3. set the experiment's groundClient to 'ground'.

Once you build the workflow, call pull() on the artifact you want to produce. You can find it in ~/jarvis.d/.

If you pass in a non-empty dict to pull (see lifted_twitter.py), the call will return a pandas dataframe with literals and requested artifacts for the columns, and different trials for the rows.

Note on data

The dataset used in some of our examples has migrated.

Example program

Contents of the examples/plate.py file:

import jarvis

ex = jarvis.Experiment('plate_demo')

ex.groundClient('ground')

ones = ex.literal([1, 2, 3], "ones")
ones.forEach()

tens = ex.literal([10, 100], "tens")
tens.forEach()

@jarvis.func
def multiply(x, y):
    z = x*y
    print(z)
    return z

doMultiply = ex.action(multiply, [ones, tens])
product = ex.artifact('product.txt', doMultiply)

product.pull()
product.plot()

On run produces:

10
20
30
100
200
300

Motivation

Jarvis should facilitate the development of auditable, reproducible, justifiable, and reusable data science workflows. Is the data scientist building the right thing? We want to encourage discipline and best practices in ML workflow development by making dependencies explicit, while improving the productivity of adopters by automating multiple runs of the workflow under different configurations.

Features

  • Simple and Expressive Object Model: The Jarvis object model consists only of Actions, Artifacts, and Literals. These are connected to form dataflow graphs.
  • Data-Centric Workflows: Machine learning applications have data dependencies that obscure traditional abstraction boundaries. So, the data "gets everywhere": in the models, and the applications that consume them. It makes sense to think about the data carefully and specifically. In Jarvis, data is a first-class citizen.
  • Artifact Versioning: Jarvis uses git to automatically version every Artifact (data, code, etc.) and Literal that is in a Jarvis workflow.
  • Artifact Contextualization: Jarvis uses Ground to store data about the context of Artifacts: their relationships, their lineage. Ground and git are complementary services used by Jarvis. Together, they enable experiment reproduction and replication.
  • Parallel Multi-Trial Experiments: Jarvis should enable data scientists to try more ideas quickly. For this, we need to enhance speed of execution. We leverage parallel execution systems such as Ray to execute multiple trials in parallel.
  • Visualization and Exploratory Data Analysis: To establish the fitness of data for some particular purpose, or gain valuable insights about properties of the data, Jarvis will leverage visualization techniques in an interactive environment such as Jupyter Notebook. We use visualization for its ability to give immediate feedback and guide the creative process.

License

Jarvis is licensed under the Apache v2 License.

jarvis's People

Contributors

jegonzal avatar jhellerstein avatar malharpatel avatar rlnsanz avatar sonajeswani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jarvis's Issues

Current directory where is JarvisFile

If we call a JarvisFile from a different directory (e.g. python [absolute_path]) we will version the incorrect directory.

The 'current directory', that Jarvis cares about is the directory where the JarvisFile is located.

Jarvis.Aggregate

We have a way to do fan-out of multi-trial experiments, but no way to do fan-in (such as picking the best model from many trials, and then using it later in the same pipeline). Aggregation must be expressed in Jarvis to be more expressive.

Performance optimization (Make)

Don't re-run everything on execution of the workflow. Only run the sub-graph that could change given the modification of an artifact by the user.

jarvis.pull() for multiple out_artifacts

My workflow has two out_artifacts at the end of the DAG, and I have to run pull() on both of them, which leads to repeated computation. I create a "phony" final_artifact to get around it right now, but I shouldn't need to do that. Possible fix is to take in an array of out_artifacts in pull and run the subgraphs where the computation differs??
screen shot 2017-10-13 at 4 34 42 pm

Jarvis Bundle Experiment

Export experiments and trials, push-button execution in any environment.

  • Docker containers?
  • Something like CDE?

Jarvis Serial Pull

parallelPull has strong support and is integrated with Ray. The default pull seized to be supported, and we need to bring it back. The new and corrected pull will behave similarly to the original, as far as execution goes, but it will version the way that parallelPull does versioning. Data context tracking will also be supported as in parallelPull

Jarvis Reproduce

Reproduce an experiment or trial by materializing it on a user-specified directory. The execution of a python script enables the data scientist to re-run some experiment, or some trial within the experiment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.