dstackai / dstack Goto Github PK

A lightweight alternative to Kubernetes for AI, simplifying container orchestration on any cloud or on-premises and accelerating AI development, training, and deployment.

Home Page: https://dstack.ai/docs

License: Mozilla Public License 2.0

Python 72.74% Dockerfile 0.14% Shell 0.68% Go 5.77% Mako 0.02% HCL 0.15% Jinja 0.12% TypeScript 19.36% HTML 0.04% SCSS 0.44% CSS 0.01% JavaScript 0.52%

machine-learning python aws azure gcp gpu llms cloud orchestration fine-tuning

dstack's Introduction

dstack is a lightweight alternative to Kubernetes, designed specifically for managing the development, training, and deployment of AI models at any scale.

dstack is easy to use with any cloud provider (AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, etc.) or any on-prem clusters.

If you already use Kubernetes, dstack can be used with it.

Accelerators

dstack supports NVIDIA GPU and Google Cloud TPU out of the box.

Major news ✨

[2024/07] dstack 0.18.8: GCP volumes (Release)
[2024/07] dstack 0.18.7: Fleets, RunPod volumes, dstack apply, and more (Release)
[2024/05] dstack 0.18.4: Google Cloud TPU, and more (Release)
[2024/05] dstack 0.18.3: OCI, and more (Release)
[2024/05] dstack 0.18.2: On-prem clusters, private subnets, and more (Release)

Installation

Before using dstack through CLI or API, set up a dstack server.

1. Configure backends

If you want the dstack server to run containers or manage clusters in your cloud accounts (or use Kubernetes), create the ~/.dstack/server/config.yml file and configure backends.

2. Start the server

Once the ~/.dstack/server/config.yml file is configured, proceed to start the server:

$ pip install "dstack[all]" -U
$ dstack server

Applying ~/.dstack/server/config.yml...

The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da"
The server is running at http://127.0.0.1:3000/

Note It's also possible to run the server via Docker.

The dstack server can run anywhere: on your laptop, a dedicated server, or in the cloud. Once it's up, you can use either the CLI or the API.

3. Set up the CLI

To point the CLI to the dstack server, configure it with the server address, user token, and project name:

$ pip install dstack
$ dstack config --url http://127.0.0.1:3000 \
    --project main \
    --token bbae0f28-d3dd-4820-bf61-8f4bb40815da
    
Configuration is updated at ~/.dstack/config.yml

4. Create on-prem fleets

If you want the dstack server to run containers on your on-prem servers, use fleets.

How does it work?

Before using dstack, install the server and configure backends.

1. Define configurations

dstack supports the following configurations:

Dev environments — for interactive development using a desktop IDE
Tasks — for scheduling jobs (incl. distributed jobs) or running web apps
Services — for deployment of models and web apps (with auto-scaling and authorization)
Fleets — for managing cloud and on-prem clusters
Volumes — for managing persisted volumes
Gateways — for configuring the ingress traffic and public endpoints

Configuration can be defined as YAML files within your repo.

2. Apply configurations

Apply the configuration either via the dstack apply CLI command or through a programmatic API.

dstack automatically manages provisioning, job queuing, auto-scaling, networking, volumes, run failures, out-of-capacity errors, port-forwarding, and more — across clouds and on-prem clusters.

More information

For additional information and examples, see the following links:

Contributing

You're very welcome to contribute to dstack. Learn more about how to contribute to the project at CONTRIBUTING.md.

License

Mozilla Public License 2.0

dstack's People

Contributors

Stargazers

Watchers

Forkers

jeffamaxey sudipkrk ivan-bokov techrajk kengz balciiberk carolinedlu dmayboroda mohamedtaoufik techieteee tendaishoko sam-khan sandeepvikram elm8116 energyray-com paulowiz michelkok christellejulias riwajsapkota jaedukseo ankittkp patrickloeber pavadik r4victor ismoilovmuhriddin stepanbakshayev sxl1993 shalevy1 sauravsolanki ch3rny1 ksasi techthiyanes arpitjain799 kamilkrzyskow kshitijdwivedi28 harsimranmaan13 brunoscaglione blu-geek axitkhurana mmby85 eternalerrors odoochain silvacarl2 polya20 alxfoster muhtasham jesusoctavioas ego capdevc dantegpt xc0r deep-diver promsoft tmonty12 liljenstolpe tleyden mvandermeulen smtm-capital plutov hbcbh1999 yuelight lawweiliang kevkibe mhoudini bijay555 apollohuang1 spott raviakash irohith beeehappyandfree dsdanielpark ganbayard mekongdelta-mind tnissen375 pirero muddi900 loghijiaha bihan de30 sorokinvld yixiaoer gerbylev gluver quocdat-le-insacvl vineetp6 adamrashi suyambuganesh82 namesdeepak tamanobi stophobia er-sadiq swsvc

dstack's Issues

Provide on-demand compute

Currently, in order to use dstack, the user needs to either have an existing cloud account or own hardware.
It would be great if dstack provided its own compute provider and allow users to use dstack without having their own cloud account or hardware.
On one hand, dstack could provide a number of free GPU hours for the trial.
On the other hand, dstack could provide a way to pay for the spent hours, e.g. via a card.

Unable to add key, value pair in Secrets

I am unable to add key, value pair in Secrets.

I go through the following steps. In the end, I do not see any key, value pair (see last image).

Allow on-demand runners inbound traffic

Pass job environment variables to the container

Now, every job may have its own environment variables set by the provider – see the property environment in the job. It's a map of string to string. The runner should pass these environment variables to the job container.

AWS_DEFAULT_REGION wasn't recognized when started via Docker

PyTorch DDP provider

tag doesn't work after uploading

using dstack artifacts upload I've provided the tag I wanted to assign to the data. Unfortunately, further runs depending on this data were failing without any logs. I've just removed the tag from the data (after it was successfully uploaded) and assigned it (the same tag) once again. Without any changes in the local repo, the code was resurrected and launched easily on dstack.

Show workflow parameters in dashboard

1. apart from depends-on would be nice to have a textual representation of the dependency (name) 2. apart from run itself, it's tag may be helpful

Allow runners to expose job ports

Dask provider

Support "restart" command

The command should work similarly to dstack run but instead of creating new jobs, it should change the existing jobs to theSubmitted status.

Hide jobs from the user

Allow runners to configure hostname

Assign hostnames to on-demand runners

Logs are displayed in the wrong order

Support multiple CUDA versions

Here's one way to do it:

Allow dstack-runner to read the CUDA version from the runner.yaml. Add config --cuda <...> argument to dstack-runner.
Allow dstack-runner to replace ${{ cuda }} within jobs' image_namewith the configured CUDA version. Do the same for the docker image that is used to runnvidia-smi`.

Simplify AWS settings

Questions:

How to avoid requiring the user to add limits manually through the UI?
How to determine in what regions and what instance types it's allowed to use?
Is it possible to let the user configure it through the code and not UI?
How to make AWS configuration as easy as possible?

Allow users to configure secrets

Add the Docker option to the website

Allow runners to configure server address

Proper handling of STOPPING and ABORTING run statuses of jobs and runs in runner

[Feature] Allow to configure Google Cloud as a backend

Throttle logs when sending to AWS

Pass user secrets as environment variables to jobs

Now, the /runners/ping response inside users provides secrets with the list of secrets to pass as environment variables to the jobs.

Support parsing .ssh/config by the CLI

If I use dstack init on a repository that is using SSH, dstack should be able to parse ~/.ssh/config automatically.

Report provider logs as runner logs rather than run logs

Currently, the user sees the logs from the provider in the run logs.

We should treat them as runner logs and not as run logs so the user doesn't see them.

Reduce the number of dependencies

There are tons of dependencies apart from the ones passed by the user. These dependencies are installed each time the run is submitted. It would be nice to optimize this part.
Ideas:

Select really necessary libraries
Make several pre-built sets for the most common use-cases

Containers from colab\kaggle would be really nice as they are +- classical and have expected behaviour regarding popular libraries

Allow to immediately stop and abort runs that are assigned to the runners that are not live yet (e.g. requested)

Git patch apply error

If a run or a job is restarted on the same runner, the runner tries to apply the Git patch (repo diff) and fails because of a conflict as it's trying to apply it to the folder where it has already applied the patch.

Steps to reproduce:

Make sure you have only one runner (e.g. disable on-demand runners, and start a runner locally)
Submit a run (or job, and wait when it finishes (you can stop if if needed)
Restart the run.

Expected:

The run is executed exactly as the first time

Actual:

There is an error

Log:

ERRO[2022-05-25T11:21:58Z] diff applier error                            ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3} run_name=odd-rabbit-1 job_id=e7fa162e70b1 workflow=train-mnist filename=.dstack/variables.yaml err=conflict: fragment line does not match src line
ERRO[2022-05-25T11:21:58Z] run job is finished with error                job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist run_name=odd-rabbit-1
INFO[2022-05-25T11:24:57Z] New job submitted                             job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1
WARN[2022-05-25T11:24:57Z] count of log arguments must be odd            job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 count=1
INFO[2022-05-25T11:24:58Z] git checkout                                  path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 auth=*http.BasicAuth
WARN[2022-05-25T11:24:58Z] git clone ref==nil                            branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git
INFO[2022-05-25T11:24:58Z] apply diff start                              run_name=odd-rabbit-1 dir=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 job_id=e7fa162e70b1 workflow=train-mnist
ERRO[2022-05-25T11:24:58Z] diff applier error                            job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 filename=.dstack/variables.yaml err=conflict: fragment line does not match src line ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3}
ERRO[2022-05-25T11:24:58Z] run job is finished with error                run_name=odd-rabbit-1 job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist

dstack run tagging feature

Would be nice to be able to set a tag for the run right from the console like
dstack run train-model --tag latest

Show workflow parameters and variables in UI and allow to override them via CLI

This relates to the new UI, where we hide jobs.

Make the runner submit workflow run (workflow parameters, variables, depends-on)

Allow providers to refer to master jobs

Pass master hostname and port mapping to jobs

Support interruptible in workflows.yaml and providers

User documentation for server Docker image

Handle STOPPING and ABORTING statuses of jobs in runner

In case of stopping, send SIGINT and wait until the job finishes.
In case of aborting, immediately kill the job.

Also, make sure we don't wait extra time when cleaning up job resources.

[Backend] Allow to configure Microsoft Azure as a backend

Allow CLI to configure server address

dstack artifacts download to raise error

dstack artifacts download doesn't tell you anything if there is a typo in a run name. would be nice to make some warning

Do not allow GPU instances handle tasks that don't require GPU

[Epic] Support Kubernetes backend type

Allow to schedule runs

It will be great to have an option either to specify the time to launch for the run or just say "this run to start in 2 hours"

Allow to run server via Docker image

Allow CLI to upload artifacts

In the tutorial, the data is downloaded via the library which is not customisable enough.
It would be nice to have an option to pass the data to the execution environment. For example, it may be a tag in the workflows for the data to be taken from the specified path to the aws instance.

Thank you in advance!

dstack fails with large uncommited changes

Regularly happens if you work with ipynb notebooks locally and going to submit a python file regardless if the latter was changed or not.
sometimes fails on the stage or dstack run with requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.dstack.ai/runs/submit
what is worse, sometimes this successfully goes to the servers but fails there without any notification

workflows.yaml is missing
Specified workflow cannot be found
Specified provider cannot be found
Specified tag cannot be found
Can’t fetch the repo
Can’t apply the diff + error message
Can’t download the provider
Can’t create/start Docker container
Can’t find/mount the artifact

etc