Stateless Docker Machine Learning Experiments

Here we want explore how we can use docker as a thin wrapper to run machine learning experiments in a stateless way. What do I mean by stateless? Let’s look at a stateful solution: We use some base image, e.g. nvcr.io/nvidia/pytorch:23.12-py3, and start an interactive container based off this image. We then attach a shell to this container and set up our system dependencies and project dependencies (e.g. pip requirements). To run our experiment we start some command python main.py --foo --bar. We get some results somewhere which we need to copy to some volume mapped directory on the host. We further modify something here and there in the container. Now the running container has a certain state. Sure we can detach and attach the container and hope that the server on which the container runs is never rebooted, or that we never have to switch the server and set everything up again. In most of these cases, we end up loosing our state and the actual experiment (running python main.py ...) becomes harder to reproduce.

A stateless setup: In a stateless docker setup, we want docker to act as a virtual environment for everything that is not (a) our code, (b) our data, and (c) our results. That is, we want docker to define the operating system, the system dependencies, the python version and environment, and the python dependencies. In docker, this can be done by defining a docker custom image in a Dockerfile:

# Select the base image
FROM nvcr.io/nvidia/pytorch:23.12-py3

# Select the working directory
WORKDIR /app

# Setup image: install system dependencies etc.
# RUN apt install ...

# Install Python requirements
COPY ./requirements.txt ./requirements.txt
RUN pip install -r requirements.txt

You can follow these steps by cloning this repository:

git clone https://github.com/braun-steven/docker-stateless-ml.git
cd docker-stateless-ml

We start by building the docker image:

docker build -t tutorial .

To test if everything works, we can now start a container using this image:

docker run --gpus all -it --rm tutorial python -c "print('Hello World from docker')"

Flags:

--gpus all: Give the container access to all GPUs on the host machine. Note, that for this flag we need the nvidia-container-runtime.
--rm: Remove the container after it is stopped since we do not care about the container state.

To make use of our project code, required data, and to store our results, we need to mount volumes into the container. We can do this using the --volume flag. The following will run the example =src/run.py= script of the repository using data in data/ and storing results in results/:

docker run --gpus all --rm \
    --volume "$(pwd)"/src:/app/src \
    --volume "$(pwd)"/data:/data \
    --volume "$(pwd)"/results:/results \
    tutorial \
    python /app/src/run.py

Output:

[...]  # truncated

PyTorch Version: 2.0.1+cu117
CUDA Available: True
CUDA Version: 11.7
CuDNN Version: 8500
CUDA Device Name: Tesla V100-SXM3-32GB-H
Number of CUDA Devices Available: 16
Current CUDA Device Index: 0
Reading /data/samples.csv
Writing /results/sums.csv

We can now investigate the saved result on the host (i.e., not in the docker container) in results/sums.csv:

$ cat results/sums.csv
3.0
12.0

With this, we have successfully used docker to thinly wrap our project into an environment defined in our Dockerfile. To be able to reproduce our experiment, we only need to ensure, that a different server has the project code, the data, and a build of the docker image (docker build -t turorial .). Bonus points if we are able to synchronize or symlink project, data, and result directories across servers via some shared storage.

braun-steven / docker-stateless-ml Goto Github PK

docker-stateless-ml's Introduction

Stateless Docker Machine Learning Experiments

docker-stateless-ml's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent