Giter Site home page Giter Site logo

competitions-v1-compute-worker's Introduction

Alternative workers

Uses cool Azure features (ACI) to run compute worker docker container in serverless environment:

Adds support for nvidia GPUs

Adds support for real time detailed results

Running

Edit .env_sample and save it as .env:

BROKER_URL=<Your queue's broker URL>
BROKER_USE_SSL=True in .env.

Run the following command:

docker run \
    --env-file .env \
    --name compute_worker \
    -d \
    --restart unless-stopped \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /tmp/codalab:/tmp/codalab \
    codalab/competitions-v1-compute-worker:1.1.5

For more details: codalab/codalab-competitions/wiki/Using-your-own-compute-workers.

If you want to run with GPU:

Install cuda, nvidia, docker and nvidia-docker (system dependent)

Make sure that you have nvidia-container-toolkit set up -- this also involves updating to Docker 19.03 and installing NVIDIA drivers.

Edit .env_sample and save it as .env. Make sure to uncomment USE_GPU=True.

Then make sure the temp directory you select is created and pass it in this command

Run the following command:

sudo mkdir -p /tmp/codalab && nvidia-docker run \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/lib/nvidia-docker/nvidia-docker.sock:/var/lib/nvidia-docker/nvidia-docker.sock \
    -v /tmp/codalab:/tmp/codalab \
    -d \
    --name compute_worker \
    --env-file .env \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v1-nvidia-worker:v1.5-compat

To get output of the worker

$ docker logs -f compute_worker

To stop the worker

$ docker kill compute_worker

Development

To re-build the image:

docker build -t competitions-v1-compute-worker .

Updating the image

docker build -t codalab/competitions-v1-compute-worker:latest .
docker push codalab/competitions-v1-compute-worker

Special env flags

USE_GPU

Default False, does not pass --gpus all flag

Note: Also requires Docker v19.03 or greater, nvidia-container-toolkit, and NVIDIA drivers.

SUBMISSION_TEMP_DIR

Default /tmp/codalab

SUBMISSION_CACHE_DIR

Default /tmp/cache

CODALAB_HOSTNAME

Default socket.gethostname()

DONT_FINALIZE_SUBMISSION

Default False

Sometimes it may be useful to pause the compute worker and return instead of finishing a submission. This leaves the submission in a state where it hasn't been cleaned up yet and you can attempt to re-run it manually.

competitions-v1-compute-worker's People

Contributors

ckcollab avatar didayolo avatar scottyak avatar tthomas63 avatar zhengying-liu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

competitions-v1-compute-worker's Issues

The branches need to be cleaned

We basically have 3 docker images for the compute workers:

  • codalab/competitions-v1-compute-worker:1.1.5 (CPU)
  • codalab/competitions-v1-nvidia-worker:v1.5-compat (GPU)
  • codalab/competitions-v1-compute-worker:latest

Not sure if latest is even used.

However, we have many branches in the repository:

master
Updated 6 minutes ago by Didayolo

dependabot/pip/celery-5.2.2
Updated 2 months ago by dependabot[bot]

dependabot/pip/pyyaml-5.4
Updated 12 months ago by dependabot[bot]

17-legacy-nvidia-worker-compat-py3
Updated 14 months ago by Tthomas63

python3
Updated 2 years ago by Tthomas63

17-legacy-nvidia-worker-compat
Updated 2 years ago by ckcollab

162-nvidia-worker
Updated 2 years ago by zhengying-liu

dependabot/pip/psutil-5.6.6
Updated 2 years ago by dependabot[bot]

dependabot/pip/requests-2.20.0
Updated 2 years ago by dependabot[bot]

162-nvidia-worker-monitor
Updated 3 years ago by ckcollab

feature/realtime-detailed-results
Updated 3 years ago by ckcollab

feature/legacy-azure-version-fix
Updated 3 years ago by ckcollab

162-nvidia-worker-celery-4-3-0
Updated 3 years ago by Tthomas63

feature/fix-prune
Updated 3 years ago by Tthomas63

feature/suppress-warning
Updated 3 years ago by Tthomas63

bug in 17-legacy-nvidia-worker-compat worker.py

The docker run command for the new docker version seem to be
docker run --gpus all ...
instead of
nvidia-docker run ...

I have tested the former and it worked for me.

If with the latter, there will be the following error:
docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/1f97be47849eff264efa7a02e7fb6767ceb51ac432ddc502820f667035d84db1/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

Weird logs on Google Cloud VMs

This issue concerns the Google Cloud version of compute worker for AutoDL.

If we ssh to some workers (Google Cloud VMs, login required) and check the Docker logs:

gcloud beta compute --project "autodl-221715" ssh --zone "us-west1-a" "gpu-06-25-2019-20-30-04-000"
nvidia-docker logs -f compute_worker

We see a lot of error messages such as:
Screenshot 2019-08-22 at 18 18 56
or in text version:

entr: cannot stat '/tmp/codalab/tmp9CPTvU/run/output/detailed_results.html': No such file or directory

It seems that this issue doesn't completely block the submission handling process but it may be the cause of other issues.

Execution time limit doesn't work as expected

'--stop-timeout={}'.format(execution_time_limit),

The way we implement execution_time_limit in docker command doesn't work as we expected. This option doesn't stop the container after this limit time.
Here is Docker documentation : https://docs.docker.com/engine/reference/commandline/run/#stop-timeout

"The --stop-timeout flag sets the number of seconds to wait for the container to stop after sending the pre-defined (see --stop-signal) system call signal. If the container does not exit after the timeout elapses, it's forcibly killed with a SIGKILL signal.

If you set --stop-timeout to -1, no timeout is applied, and the daemon waits indefinitely for the container to exit.

The Daemon determines the default, and is 10 seconds for Linux containers, and 30 seconds for Windows containers."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.