Giter Site home page Giter Site logo

star-whale / starwhale Goto Github PK

View Code? Open in Web Editor NEW
191.0 6.0 32.0 72.26 MB

an MLOps/LLMOps platform

Home Page: https://starwhale.ai

License: Apache License 2.0

Java 38.53% Python 26.95% Makefile 0.20% Shell 0.47% JavaScript 0.45% HTML 0.03% TypeScript 29.24% SCSS 0.19% CSS 0.33% Smarty 0.03% Jupyter Notebook 3.30% Dockerfile 0.01% MDX 0.05% EJS 0.20% Jinja 0.01%
mlops ai infra cloud-native kubernetes model-evaluation dataset runtime datastore llm

starwhale's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

starwhale's Issues

add some probes for agent daemonset

Current Behavior
No check for agent daemonset.

Proposed Behavior
Add livenessProbe/readinessProbe/startupProbe for agent daemonset.

  • add some api or cmd for agent pods.
  • config helm charts.

config timezone in controller

Current Behavior
use UTC timezone.

Proposed Behavior
controller can config timezone for cluster, then task/agent use it.

display system version on UI

Is your feature request related to a problem? Please describe.
user don't know what the version of SW is when he/she is attempting to make a bug report

Describe the solution you'd like

  1. make the version info avaliable in the OS envirenment variables when building controller and agent images.
  2. agent reads the OS ENV variable and reports it's version info to controller
  3. controller reads the OS ENV and agents' reports
  4. controller offers an API to UI and cli to expose these version info through

auto to detect python dependencies

Is your feature request related to a problem? Please describe.
Swcli only use pre-defined pip-req.txt or pip freeze in venv environment or conda export in conda environment today. Can we auto detect python dependencies by static-code-analysis ? It will more simple for end-user.

Describe the solution you'd like
WIP.

design swmp 2.0

Current Behavior
swmp 1.0 is a very simple structure, which is a single tar file. one line changed will make a new swmp tar, it is uneconomical.

Proposed Behavior

  • we will introduce layer mechanism in swmp.
  • add blake2b checksum into meta file.
  • if no files changed, should swcli build a new swmp version?

swmp list too too slow

Describe the bug
when we run swcli model list, it will take a lot of time to read meta in swmp tar file for meta.

Expected behavior
quick、efficient method for list.

support tag for swmp and swds

Current Behavior
Only very long version to track swmp/swds.

Proposed Behavior
Tag can reduce the users short-term memory load and provide more human friendly interactive.

Solution Proposal
WIP.

design starwhale cloud demo version

Current Behavior
only in-premise version.

Proposed Behavior
Provide a cloud version, include some features:

  • user isolation
  • resource isolation
  • user register
  • cloud host plan
  • deployment and release

[Pending]design evaluation report schema

Current Behavior

  • Only multi_classification report.

Proposed Behavior

  • Define a general report schema, which can descibe a variety of machine learning problems. The schema will be used by web-ui and client-terminal-ui to show some tables, graphs, summary etc.
  • a schema
  • a validator
  • some machine learning examples
  • define some frontend ui primitive

Ref

python ci

Current Behavior
no ci

Proposed Behavior

  • add ci framework for python #291
  • python black lint : #299
  • flake8 check: #302
  • mypy check: #305
  • unittest: #306
  • code test coverage: #310
  • release pypi by tag action: #316

refactor client setup.py requirements

Current Behavior
setup.py and requirments.txt include the same python requirements.

Proposed Behavior
Only one place include python requirements.

model.yaml/dataset.yaml runtime bug

Describe the bug
Users can write wrong runtime field which is not actual python runtime. This issue will lead import wrong python version.

To Reproduce
Write runtime: 3.9, but venv local mode use python3.7.

Expected behavior

  • use _manifest.yaml runtime first. #147
  • auto check runtime field.

conda export sometimes omit starwhale package

Describe the bug

  • conda export omit starwhale python package randomly.
  • Reproduce Case: conda cannot export editable starwhale.

Expected behavior
conda export MUST include starwhale, if not , it will lead to import error in ppl/cmp phase.

Proposal
add some doc and console warnings

complete swds/swmp push methanism

Current Behavior
run push cmd, and wait wait cmd exit. no upload progress, no auto retry.

Proposed Behavior

  • more friendly output.
  • head first for version.

Some Controller API suggestions

Enhancement

  • createTime vs createdTime vs startTime
    We can use one uniform verb to describe create time in all apis. pr: #190
    image
  • /api/v1/project/{pid}/dataset/{did} may return serialized meta field directly.
  • /api/v1/project/{pid}/model/{mid}/version add meta field.
  • /api/v1/login return user's details, such as role, createdtime.
  • should we add owner field in model/dataset list api? It will bring a lots of redundant fields.
  • could /api/v1/project api add username query string to show the specific projects? in default, only show the projects of current login user.

Bugs

  • /api/v1/project/{pid}/job/{jid}/task
    • this api is lack of finishTime and duration field.
    • this api is lack of task type field, which can describe cmp or ppl task.
  • modelName is required for /api/v1/project/{pid}/model ?
    • make it as a optional field?
      image

Feature Request

  • need one api to show all models with external parameters
    • scenario: run swcli model list --remote cmd in local environment. users may specify multi dimensional parameters, such as --project , --model-name, --self (own projects models).
  • need one api to show all datasets with external parameters
    • scenario: same as models api.
  • /api/v1/project/{pid}/model/{mid} api add version into url path or params.
    • this api only show the latest version files now. client need fetch the different version's files.
    • model and dataset may use consistent schema, which make cli or frontend write simply code.
  • `/api/v1/project/{pid}/model/{mid}' show more details in one api.
    • such as model create time, meta, files etc.
  • `/api/v1/project/{pid}/dataset/{mid}' show more details in one api.
    • such as dataset create time, meta, files etc.

cc @dreamlandliu @anda-ren

agent uniqueness change

Is your feature request related to a problem? Please describe.
Now we give an unique constrain on anget through ip address. But the ip address of an agent is constantly changing in K8S context. A new uniqueness field of agent is required

Describe the solution you'd like

  • add an unique field of agent called serial_number
  • agents report the serial_number to controller and make sure these numbers are universal unique

define a `.swignore` file to ignore some files when build swmp/swds

Is your feature request related to a problem? Please describe.
We need write exclude_pkg_data in dataset.yaml or model.yaml twice. In another side, a lot of ignore fields may be disturb the concision of model.yaml/dataset.yaml.

Describe the solution you'd like
Define .swignore file, REMOVE exclude_pkg_data field.

typo mutlilabel

Describe the bug
/api/v1/project/{pid}/job/{jid}/task/result use typo world.

To Reproduce
visit result api.

Expected behavior
mutlilabel -> multilabel

add auc/roc output for multi_classification report

Current Behavior
multi_classification does not include auc/roc data.

Proposed Behavior
auc/roc is very common and import for multi_classification problem.

Solution Proposal
add methods for multi_classification decorator.

block in loading data from s3/minio

Describe the bug
if minio/s3 cannot be connected, the load program will be blocked unitl the user kill the process.

Expected behavior

  • config timeout
  • auto retry
  • active exit proccess
  • more user friendly log

agent dynamically generate OSS connections

Current Behavior
agent generate fixed oss connections via properties when bootstrap

Proposed Behavior
dynamically generated according to the connection information sent by the controller

Solution Proposal
dynamicallly generated

eval run cmd in local environment

Current Behavior
We only run the complete-flow evaluation in controller.

Proposed Behavior
When we run swcli eval run --local,cli will use local swmp and swds to run complete-flow evaluation. The feature make a great help for debug.

  • Use docker to reproduce eval flow.
  • Do not depend on minio.

Design Proposal

  • cmd schema: swcli eval run [--local/--remote] --model xx --dataset xx --dataset yy [--project xx] [--baseimage xx] [--resource gpu:1] [--name xx] [--description xx] [--phase]
    • --local/--remote: optional, in local or remote cluster, remote cluster is the default option.
    • --model: required, model id or model name:version(local mode)
    • --dataset: required, dataset id or datset name:version(local mode)
    • --project: optional, project id, only for remote cluster mode.
    • --baseimage: optional, task run image. if omitted, starwhale will use the latest baseimage, name is: starwhaleai/starwhale:latest
    • --resource: optional, only for remote cluster mode, fmt is [resource:gpu|cpu]:[cnt:int >0], default is cpu:1
    • --name: optional, eval job name, the username-timestamp-randomstr is the default name.
    • --desc: optional.
    • --gencmd: optional, only generate docker run cmd in local mode.
    • --phase: optional, only for local mode. choices: all|ppl|cmp, default is all.
  • local mode:
    • search swmp and extract swmp into workdir
    • generate swds fuse input.json
    • render/run docker run cmd for ppl
    • render/run docker run cmd for cmp
    • show progress
    • parse and render cmp report
    • eval storage for local mode
    • in {@snapshot_dir}/run/eval/{version}/, we will store all result artifacts.
    • swcli eval list --local will show local eval list.
    • swcli eval inspect xxx --local will show local result and report.
  • external cmd:
    • swcli dataset fuse xxx : generate fuse input.json
    • swcli model extract xxx: extract swmp tar

tune size for mnist swmp

Current Behavior
very simple mnist example will use more than 2GB size for swmp, should we do some optimizations for reduce swmp size?

Proposed Behavior

  • swmp layers
  • base image pre-install pytorch

js/ts ci

Current Behavior
no ci

Proposed Behavior

  • add js/ts lint ci
  • add compile checker

java ci

Current Behavior
no ci

Proposed Behavior

  • java lint
  • java unittest
  • release controller/agent docker image

agent online & offline management

Is your feature request related to a problem? Please describe.

  • controller can't tell if an agent is being down or long time reporting
  • controller can't tell if an agent is removed or being down

Describe the solution you'd like

  • when controller receives agent's report it will define the deadline before which agent should report next time. The deadline is dispatched to agent.
  • controller will set agent status to down if it does't receive report from the agent before the deadline
  • user remove removed agent on the UI (let user make the removing decision)

Subtasks

controller proxy/cache for pypi/docker image/conda

Current Behavior
controller does not support proxy or cache. when agent reproduce swmp python package or pull image, it will spend a lot for time.

Proposed Behavior

  • cache: pypi or docker image or conda channel
  • settings: upstream address, conda channel, cache expire time, timeout settings
  • storage: use minio or s3
  • auto-inject: agent will auto inject envs or files into task container.
    • ~/.pip/pip.conf
    • docker /etc/docker/daemon.json mirror field
    • ~/.condarc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.