Giter Site home page Giter Site logo

iterative / mlem Goto Github PK

View Code? Open in Web Editor NEW
714.0 25.0 42.0 1.33 MB

๐Ÿถ A tool to package, serve, and deploy any ML model on any platform. Archived to be resurrected one day๐Ÿคž

Home Page: https://mlem.ai

License: Apache License 2.0

Python 99.32% Jinja 0.21% HCL 0.18% Jupyter Notebook 0.17% Shell 0.12%
python data-science machine-learning developer-tools model-registry deployment git mlem cli

mlem's Introduction

image

Check, test and release codecov PyPi License: Apache 2.0

MLEM helps you package and deploy machine learning models. It saves ML models in a standard format that can be used in a variety of production scenarios such as real-time REST serving or batch processing.

  • Run your ML models anywhere: Wrap models as a Python package or Docker Image, or deploy them to Heroku, SageMaker or Kubernetes (more platforms coming soon). Switch between platforms transparently, with a single command.

  • Model metadata into YAML automatically: Automatically include Python requirements and input data needs into a human-readable, deployment-ready format. Use the same metafile on any ML framework.

  • Stick to your training workflow: MLEM doesn't ask you to rewrite model training code. Add just two lines around your Python code: one to import the library and one to save the model.

  • Developer-first experience: Use the CLI when you feel like DevOps, or the API if you feel like a developer.

Why is MLEM special?

The main reason to use MLEM instead of other tools is to adopt a GitOps approach to manage model lifecycles.

  • Git as a single source of truth: MLEM writes model metadata to a plain text file that can be versioned in Git along with code. This enables GitFlow and other software engineering best practices.

  • Unify model and software deployment: Release models using the same processes used for software updates (branching, pull requests, etc.).

  • Reuse existing Git infrastructure: Use familiar hosting like Github or Gitlab for model management, instead of having separate services.

  • UNIX philosophy: MLEM is a modular tool that solves one problem very well. It integrates well into a larger toolset from Iterative.ai, such as DVC and CML.

Usage

This a quick walkthrough showcasing deployment functionality of MLEM.

Please read Get Started guide for a full version.

Installation

MLEM requires Python 3.

$ python -m pip install mlem

To install the pre-release version:

$ python -m pip install git+https://github.com/iterative/mlem

Saving the model

# train.py
from mlem.api import save
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

def main():
    data, y = load_iris(return_X_y=True, as_frame=True)
    rf = RandomForestClassifier(
        n_jobs=2,
        random_state=42,
    )
    rf.fit(data, y)

    save(
        rf,
        "models/rf",
        sample_data=data,
    )

if __name__ == "__main__":
    main()

Codification

Check out what we have:

$ ls models/
rf
rf.mlem
$ cat rf.mlem
Click to show `cat` output
artifacts:
  data:
    hash: ea4f1bf769414fdacc2075ef9de73be5
    size: 163651
    uri: rf
model_type:
  methods:
    predict:
      args:
      - name: data
        type_:
          columns:
          - sepal length (cm)
          - sepal width (cm)
          - petal length (cm)
          - petal width (cm)
          dtypes:
          - float64
          - float64
          - float64
          - float64
          index_cols: []
          type: dataframe
      name: predict
      returns:
        dtype: int64
        shape:
        - null
        type: ndarray
    predict_proba:
      args:
      - name: data
        type_:
          columns:
          - sepal length (cm)
          - sepal width (cm)
          - petal length (cm)
          - petal width (cm)
          dtypes:
          - float64
          - float64
          - float64
          - float64
          index_cols: []
          type: dataframe
      name: predict_proba
      returns:
        dtype: float64
        shape:
        - null
        - 3
        type: ndarray
  type: sklearn
object_type: model
requirements:
- module: sklearn
  version: 1.0.2
- module: pandas
  version: 1.4.1
- module: numpy
  version: 1.22.3

Deploying the model

If you want to follow this Quick Start, you'll need to sign up on https://heroku.com, create an API_KEY and populate HEROKU_API_KEY env var (or run heroku login in command line). Besides, you'll need to run heroku container:login. This will log you in to Heroku container registry.

Now we can deploy the model with mlem deploy (you need to use different app_name, since it's going to be published on https://herokuapp.com):

$ mlem deployment run heroku app.mlem \
  --model models/rf \
  --app_name example-mlem-get-started-app
โณ๏ธ Loading model from models/rf.mlem
โณ๏ธ Loading deployment from app.mlem
๐Ÿ›  Creating docker image for heroku
  ๐Ÿ›  Building MLEM wheel file...
  ๐Ÿ’ผ Adding model files...
  ๐Ÿ›  Generating dockerfile...
  ๐Ÿ’ผ Adding sources...
  ๐Ÿ’ผ Generating requirements file...
  ๐Ÿ›  Building docker image registry.heroku.com/example-mlem-get-started-app/web...
  โœ…  Built docker image registry.heroku.com/example-mlem-get-started-app/web
  ๐Ÿ”ผ Pushing image registry.heroku.com/example-mlem-get-started-app/web to registry.heroku.com
  โœ…  Pushed image registry.heroku.com/example-mlem-get-started-app/web to registry.heroku.com
๐Ÿ›  Releasing app example-mlem-get-started-app formation
โœ…  Service example-mlem-get-started-app is up. You can check it out at https://example-mlem-get-started-app.herokuapp.com/

Contributing

Contributions are welcome! Please see our Contributing Guide for more details.

Thanks to all our contributors!

Copyright

This project is distributed under the Apache license version 2.0 (see the LICENSE file in the project root).

By submitting a pull request to this project, you agree to license your contribution under the Apache license version 2.0 to this project.

mlem's People

Contributors

0x2b3bfa0 avatar aguschin avatar aliahari avatar aminalaee avatar casperdcl avatar daavoo avatar dacbd avatar itstargetconfirmed avatar jorgeorpinel avatar keithing avatar madhur-tandon avatar maheshambule avatar mike0sv avatar omesser avatar renovate[bot] avatar suor avatar terryyylim avatar vvssttkk avatar ykasimov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlem's Issues

Set up releasing process to other places

Register "mlem" at snap, choco, brew, conda, apt-get, yum and other indexes

We have release on PyPi. We definitely need conda-forge. DVC have number of other distribution channels, like .deb, .rpm, brew, etc. Investigate whether we need those and set them up.

Set up repo workflow

Set up linters, black, mypy, etc.
Set up pre-commit hooks.
See what's dvc, cml and viewer repos use.

Design `mlem.api.list()` for common model registry case

Use Model Registry in Studio as an example of how people would like to see Model registry => This may affect mlem.api.list()

Beside listing all models in a repo, we may want to list all versions of one model.

As a corner case, it can be a monorepo with .mlem located not in repo root, but in a subfolder. We need to add option to list models in such cases.

Make docs and readme ready for closed alpha release

  • Tutorial (current blog post)
  • Detailed info about CLI/API commands (Get started?) [Decided to skip this for now in favour of having docstrings. Need to do this for Beta.]
  • Readme
  • Example repo (+ README.md)

Review linters exceptions

To move faster while adding flake8 and pylint, we've to add some exceptions, which should be reviewed.

`dvclive` integration?

Sorry if I'm missing some context regarding the scope of mlem; I have reading a bit of the existing documentation in Notion and skimmed through the code in this repository. I'm commenting this here instead of the dvclive repo because afaik mlem is not currently "public".

So, in dvclive there is an open discussion on how to (or even whether to) add support for saving models:

iterative/dvclive#105

I was just thinking about what other ML Loggers offer in that regard and came across (I was actually an user of this feature in my previous company) MLFlow Models. I think that having an "unified" model metadata format could be a good point to justify the addition of a dvclive.save_model functionality.

As far as I understand from the components description, mlem follows a similar approach to MLFlow Models so, given that it is in dvclive plans to work and extend it's integrations with ML Frameworks, it seems like this model serialization could be a common point of interest for both projects.

Does this make sense for those working on mlem?

Investigate options to apply MLEM models within Airflow DAGS

As discussed with @mnrozhkov and @tapadipti, it would be great if MLEM would somehow simplify model application within Airflow DAGs. Two questions to start with:

  1. We can create something like MLEMOperator. What would be it's functionality? How it will help users?
  2. We need to either build virtual environment or Docker image to apply the model in the required environment. Two options to provide those would be either do this in CI or as a task in the same DAG. We need to explore these options and find out how MLEM can simplify work for users here. Note: if you run multiple workers then it may be beneficial to build env in advance. If you have one worker, you may be ok with building it while running MLEMOperator.

Other notes:

  1. Sometimes data is huge and you need to process it in chunks (it may or may not be the case with pyspark. Without pyspark it can be too hard to fit all data in RAM). We need some way to resolve this, e.g. iterate on batches and then compile answer containing predictions from all batches.
  2. Usually, you DAG=processing+scoring. Roughly, in 25% you load data from disk; in other 50% you work with big data (pyspark, Hadoop); in last 25% you work with distributed computing (spark).

DAGs example https://gitlab.com/iterative.ai/cse/use_cases/home_credit_default/-/blob/airflow-dags/dags/scoring.py
Showcase of different options: https://gitlab.com/iterative.ai/cse/rnd/deploy-to-airflow

Summary from Mikhail https://iterativeai.slack.com/archives/C0249LW0HAQ/p1631885782026400

Add proper DVC support

Save/load files with DVC.
Question: should this be part of core or should we make it a plugin to facilitate creation of other storage extensions?

Load models from GH

When you do mlem.api.save(model, name), name can point to local fs, GitHub repo or other remote storages like s3, etc.
We need to

  • Create a precise textual description on how we should parse and resolve name
  • Write tests which describe this behaviour
  • Correct the actual code to satisfy tests

Deploying models with seldon-core

@DavidGOrtega suggested on slack:

Hey guys. Reviewing OpenMlOps they are exposing the models via Seldon and Ambassador API Gateway. This is similar to the QuaroML stack that I did and a very convenient and easy way to expose and of course scale the models.
Maybe we can introduce MLEM inside the TPI in conjunction with Consul

We need to get back to this discussion to shape the vision for MLEM deployment part. It's better to do as soon as we are finished with the closed alpha release.

Create PR to `pydantic`

Extract mlem.polydantic and suggest adding it to pydantic itself. It's not clear whether they will accept the change. There is also an issue with openapi schema which probably cannot be generated for polymorphic objects

Website

Draft document for website design and start iterating with Yaroslav or Serge

Disambiguate passing multiple datasets to `apply`

@mike0sv comment:

it seems that this logic will apply model to multiple datasets in one run. My vision was to apply model once, but using multiple datasets (for example, .fit with X and y)

It seems we need to support both approaches. I had a case in mind when you have a folder with images and want to run your NN on all of them at once.

Support links to objects located in git repos

Consider different options users will want to use links for:

# case: link to git repo, link itself can stored wherever you want
link_type: model
mlem_link:
    path: data/model/mlem.yaml
    root_path: examples/dvc-pipeline
    repo: https://github.com/iterative/mlem-prototype
object_type: link


# case: just link to an object inside local folder with .mlem
link_type: model
mlem_link:
    path: data/model/mlem.yaml
object_type: link


# case: link to external object with fsspec-parsed URI
link_type: model
mlem_link:
    path: github://iterative:mlem-prototype@committed-data/examples/dvc-pipeline/data/model
object_type: link

Right now we support reading last two examples, but instead of mlem_link: path: bla-bla we expect mlem_link: path.
Regarding creation, we support only creating the second link. We need to support these examples:

# fsspec-parsed URI (GitHub protocol is only the example, this should support all fsspec-supported protocols)
mlem link github://iterative:mlem-prototype@committed-data/examples/dvc-pipeline/data/model gh_model

# using `repo` and `rev` args
mlem link examples/dvc-pipeline/data/model --repo https://github.com/iterative/mlem-prototype --rev committed-data gh_model

We also may flatten keys in yaml to better process this with Pydantic, not sure which option will be better.

Interface for batch loading for datasets

For big datasets and potentially for multi-file datasets (eg collection of images) we need an alternative way of reading data in batches. Potential use-case:
mlem apply model some-very-big.csv --batch 10000

  1. Design interface changes. Probably this should be a new method in DatasetReader, also Dataset and DatasetMeta could be affected. For some dataset types batching is impossible, so we should either ignore batch option and fallback to regular read, or raise an error (this should be configurable). Also, support for batch reading should be specified for each dataset type.
  2. Add new option to mlem apply cli and change underlying API call to support this feature
  3. Implement batch reading for pandas as POC, other dataset types may not support this

Move support for misc libraries

For now, we have only fastapi, numpy, pandas, sklearn support. Need to move all other previously supported libs.

For first closed alpha release we add what's related to table data only.

Resolve git urls in `mlem get`

What doesn't work right now:

mlem get https://github.com/iterative/example-mlem/data/model
mlem get data/model --repo https://github.com/iterative/example-mlem

Related to #4

Connect with volunteers interested in using MLEM

  • Draft a intro message about MLEM, check if everything in repo is (soon will be) ready
  • Search discord for interested people
  • Schedule calls to make onboarding

Should start this ~ 1 week before everything (code + docs) is ready, cause scheduling calls will take some time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.