petuum / adaptdl Goto Github PK

View Code? Open in Web Editor NEW

401.0 11.0 74.0 2.59 MB

Resource-adaptive cluster scheduler for deep learning training.

Home Page: https://adaptdl.readthedocs.io/

License: Apache License 2.0

Shell 4.78% Makefile 0.34% Python 94.48% Dockerfile 0.40%

deep-learning kubernetes pytorch distributed-systems aws distributed-training machine-learning cloud

adaptdl's Introduction

Introduction

Documentation | Examples | CASL Project

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework, and is part of the CASL open source project. The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

AdaptDL consists of two components which can be used together with or separately from one another:

adaptdl-sched: A cluster scheduler on Kubernetes optimized for distributed deep learning training.
adaptdl: A library for adaptive batch sizes that can efficiently scale distributed training to many nodes.

Some core features offered by AdaptDL are:

Elastically schedule distributed DL training jobs in shared clusters.
Cost-aware resource auto-scaling in cloud computing environments (e.g. AWS).
Automatic batch size and learning rate scaling for distributed training.

AdaptDL supports PyTorch training programs. TensorFlow support coming soon!

Why AdaptDL?

Efficient Resource Management

The AdaptDL scheduler directly optimizes cluster-wide training performance and resource utilization, by using a genetic algorithm to periodically optimize resource allocations for all jobs. Through elastic re-scaling, co-adapting batch sizes and learning rates, and avoiding network interference, AdaptDL significantly accelerates shared-cluster training when compared with alternative schedulers. For details, please see our OSDI'21 research paper.

In the cloud (e.g. AWS), AdaptDL auto-scales the size of the cluster based on how well those cluster resources are utilized. AdaptDL automatically provisions spot instances when available to reduce cost by up to 80%.

Adaptive Batch Size Scaling

Efficient distributed training requires careful selection of the batch size and learning rate, which can be tricky to find manually. AdaptDL offers automatic batch size and learning rate scaling, which enables efficient distributed training without requiring manual effort. To achieve this, AdaptDL measures the system performance and gradient noise scale during training, adaptively selects the most efficient batch size, and scales the learning rate using AdaScale.

Easy-to-use Elastic API

Making training programs run elastically can be challenging and error-prone. AdaptDL offers APIs which make it easy to enable elasticity for data-parallel PyTorch programs. Simply change a few lines of code, without heavy refactoring!

BEFORE:

torch.distributed.init_process_group("nccl")
model = torch.nn.parallel.DistributedDataParallel(model)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=128)
for epoch in range(100):
    ...

AFTER:

adaptdl.torch.init_process_group("nccl")
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
dataloader = adaptdl.torch.AdaptiveDataLoader(dataset, batch_size=128)
for epoch in adaptdl.torch.remaining_epochs_until(100):
    ...

Getting Started

AdaptDL consists of a Kubernetes job scheduler and an adaptive training library. They can be used in two ways:

Scheduling multiple training jobs on a shared cluster or the cloud (Scheduler Installation).
Adapting the batch size and learning rate for a single training job (Standalone Training).

adaptdl's People

Contributors

Stargazers

Watchers

Forkers

tpnguyen rmfan rohitpandey13 tonyzhanghm milkigit jessezbj zeyawang wintersurvival young768 sangkeun00 harsha070 distributed-deep-learning pandyakaa xrosliang micseb hwfan wanghongsheng01 gaocegege jianzi123 zhaojp-frank stevenjokess lukelin-web batermj odp rutwik-n-jain jaydentaojy kinman0224 xxxx001 lesliexpy zouyee jamindy doytsujin hsdrose pysqz vincentyaombzuai mohamed-bin-zayed-university-of-ai xuezhi-liang lzhangbv bhaskar2443053 learncv ftwh warmchang yzs981130 machinelearningsystem eth-easl gudiandian tingshua-yts yuxiangjohn janewuneu webclinic017 suhasjs klonggan yuxiangwei0808 apanwariisc samanta-amit tianhao909 sungbalance ankitshah009 knagrecha markhng525 hliangzhao shouxulin binbinmeng bingting00 hellojiangjiang iq-scm dreamocean-lmf id-2 kxuanzhang aigabrielwang bradley39e chengyinie keshavaspanda stone-researchlife xmoon2022

adaptdl's Issues

AdaptDL submit with insecure registry doesn't work when scheduler is in a non-default namespace

Steps to reproduce:

Install AdaptDL scheduler to the adaptdl namespace with helm install adaptdl adaptdl --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace adaptdl --create-namespace --set docker-registry.enabled=true
Submit a job in the default namespace using adaptdl submit ...

Result:

kubectl get pods shows myjob-6a5a039c-adea-4eb2-bdd1-7639d9893e63-0-0 0/1 ErrImagePull 0 12s
kubectl describe pod shows an error:

Failed to pull image "adaptdl-registry.default.svc.cluster.local:31001/dev/adaptdl-submit@sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": rpc error: code = Unknown desc = failed to resolve image "adaptdl-registry.default.svc.cluster.local:31001/dev/adaptdl-submit@sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": no available registry endpoint: failed to do request: Head "https://adaptdl-registry.default.svc.cluster.local:31001/v2/dev/adaptdl-submit/manifests/sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": http: server gave HTTP response to HTTPS client

Problem when provision EKS cluster

Hi I'm trying to provision a eks cluster following the document.

eksctl create cluster -f https://raw.githubusercontent.com/petuum/adaptdl/master/deploy/eks/adaptdl-eks-cluster-on-demand.yaml

But this will fail for

nodeGroups[0].overrideBootstrapCommand is required when using a custom AMI

Then I search for this, they made some changes to Eksctl this year.
https://eksctl.io/announcements/nodegroup-override-announcement/

I tried bunch of their solution online, by adding this to the yaml

    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh <cluster name>

But it all ends exceeded the time for creating the cluster.

2022-08-11 16:21:59 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2022-08-11 16:21:59 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=us-west-2 --name=adaptdl-eks-cluster'
2022-08-11 16:21:59 [✖]  exceeded max wait time for StackCreateComplete waiter

Please help and renew the official document. Thanks!

Fail to Local Training

PyTorch version : 1.6.0+cu101
Experimental Scripts: https://github.com/petuum/adaptdl/blob/master/tutorial/mnist_step_5.py
Command:

# step 1
python3 -m pip install adaptdl
#  step 2
python -u minist_step_5.py

Error Message:

INFO:adaptdl.reducer:rank 0 connecting to 0.0.0.0 on port 35479
INFO:adaptdl.reducer:Master waiting for connections on 0
INFO:adaptdl.torch:Initializing torch.distributed using tcp://0.0.0.0:21129?rank=0&world_size=1
INFO:adaptdl.torch:torch.distributed initialized
/mnt/lustre/wgao/miniconda3/envs/adapt/lib/python3.6/site-packages/torch/nn/parallel/distributed.py:364: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. 
  "Single-Process Multi-GPU is not the recommended mode for "
INFO:adaptdl.torch.epoch:starting at epoch 0
Train Epoch: 0 [0/60000 (0%)]   Loss: 2.321594
Traceback (most recent call last):
  File "mnist_step_5.py", line 144, in <module>
    main()
  File "mnist_step_5.py", line 135, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "mnist_step_5.py", line 46, in train
    loss.backward()
  File "/mnt/lustre/wgao/miniconda3/envs/adapt/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/lustre/wgao/miniconda3/envs/adapt/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError

Support new torchtext data loading

torchtext (as of 0.4.0) adopts torch.utils.data.DataLoader, and the older iterator interface is deprecated. Ensure AdaptDL's AdaptiveDataLoader supports this new torchtext interface for data loading, and port the example transformer code to the new interface. Then, adaptdl.data.iterator can be deprecated/removed.

Validate AdaptDLJobs using an AdmissionWebhook

Currently, AdaptDLJob CRD validation occurs in the controller, after the custom resource is already created. This has several drawbacks:

It further complicates the already-complicated controller code.
Users creating invalid AdaptDLJob only gets feedback after it is successfully created and transitions to the "Failed" phase.
It's difficult to prevent the job spec from being modified after it's created. Currently the controller will let modifications occur, but fail the job if a modification is detected.

Instead, we can separate the validation logic into a ValidatingAdmissionWebhook, which validates newly-created AdaptDLJobs and initializes their status sub-resources. Invalid jobs will immediately result in an error response from the job creation step, and modifications to job specs can be prevented without failing the job entirely.

ValidatingAdmissionWebhook requires TLS. We might be able to auto-generate a certificate at the helm install step using genSelfSignedCert sprig function: http://masterminds.github.io/sprig/crypto.html.

References:

AdaptDL submit with insecure registry doesn't work with MicroK8s

Steps to reproduce:

Deploy MicroK8s following https://adaptdl.readthedocs.io/en/latest/installation/deploy-microk8s.html
Install AdaptDL scheduler to the default namespace using helm install adaptdl adaptdl --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --create-namespace --set docker-registry.enabled=true
Submit a job using adaptdl submit ...

Result:

kubectl get pods shows my-job-9gfl9-47b5f562-b96d-443a-a654-e7a6c27a66fc-0-0 0/1 ImagePullBackOff 0 44s
kubectl describe pod shows an error:

Failed to pull image "adaptdl-registry.default.svc.cluster.local:31001/dev/adaptdl-submit@sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": rpc error: code = Unknown desc = failed to resolve image "adaptdl-registry.default.svc.cluster.local:31001/dev/adaptdl-submit@sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": no available registry endpoint: failed to do request: Head "https://adaptdl-registry.default.svc.cluster.local:31001/v2/dev/adaptdl-submit/manifests/sha256:988db89a3ce7dbc6f675b77fafef36e23704c8dd8621d6121450ff51261bf086": http: server gave HTTP response to HTTPS client

Failed Pod Creation causes the scheduler to crash.

If a user adaptdljob has specifications that cause the kubernetes python api to throw an error, the scheduler will go into an error-loop.

An easy way to trigger this is to not include the image name in the job specification, which will result in something like

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/__main__.py", line 30, in <module>
    loop.run_until_complete(
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/controller.py", line 64, in run
    await asyncio.gather(
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/controller.py", line 98, in _sync_worker
    await self._sync_job(namespace, name)
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/controller.py", line 139, in _sync_job
    await self._create_pod(
  File "/usr/local/lib/python3.8/site-packages/adaptdl_sched/controller.py", line 406, in _create_pod
    await self._core_api.create_namespaced_pod(
  File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 162, in __call_api
    response_data = await self.request(
  File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 224, in POST
    return (await self.request("POST", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 181, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 05 Jan 2021 18:38:03 GMT', 'Content-Length': '441')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"linear-regression-9xrx2-ce5bcc29-94b7-453f-97b8-960177bee6a1-0-0\" is invalid: spec.containers[0].image: Required value","reason":"Invalid","details":{"name":"linear-regression-9xrx2-ce5bcc29-94b7-453f-97b8-960177bee6a1-0-0","kind":"Pod","causes":[{"reason":"FieldValueRequired","message":"Required value","field":"spec.containers[0].image"}]},"code":422}

Instead of causing an error in the scheduler, we should probably update the job status to failed and do appropriate logging.

Similar kubernetes updates and creations should be checked against this issue as well.

Batch size should only change if speedup can be significantly improved

Currently, AdaptDL will change the batch size whenever:

The job is restarted.
A new epoch is started.

This can cause the batch size to fluctuate frequently.

Instead, we should only change the batch size if the new batch size will cause a noticeable improvement in the predicted speedup (e.g. by 5% or more).

Integration with KubeFlow

Make it easy to install AdaptDL with KubeFlow and run AdaptDLJobs as part of DL workflows on KubeFlow.

Reorganize and simplify AdaptDL-PyTorch tutorial

Currently the AdaptDL-PyTorch document is a sequence of 6-7 steps, it would be more readable to re-organize to 3 sections. First section on "simple" usage of only AdaptiveDataParallel, AdaptiveDataLoader, and remaining_epochs_until. Second section on adaptive batch size, and third section on "advanced" usage of Accumulator. We only need three example training scripts, one for each of the sections, and all three are completely runnable.

Add supports to iterable-style datasets in adaptdl.torch.AdaptiveDataLoader

Currently adaptdl.torch.AdaptiveDataLoader only supports map-style datasets, which allows random access and implements the __getitem__() and __len__() protocols. It might be desirable to add supports to iterable-style datasets since sometimes the dataset might adopt a lazy loading strategy and we cannot know the size of dataset beforehand.

Expose yaml file for a stage

To integrate with other distributed systems with built in synchronization optimization, it will be great to transfer synchronization ability to it. It will be great if we can expose a yaml file generated by scheduler, so that the other system can use it.

Possible implementation:
In Controller, write a new coroutine to detect whether this stage has finished all allocation, gain the lock of all coroutines, and expose necessary information.

Export useful metrics to TensorBoard

It would be nice to include a helper function that bulk-exports all the relevant AdaptDL metrics to TensorBoard, so applications won't need to manually select and expose individual AdaptDL metrics. Some examples:

Number of replicas
Current goodput and throughput
Estimated gradient noise scale
Gain from AdaScale

Having these can improve the understandability of what AdaptDL is doing.

Make LR scaling rules pluggable

And add support for LEGW, Linear, and Sqrt scaling rules in addition to AdaScale.

Add BERT training/fine-tuning example

Add a new example under the examples/ directory that uses AdaptDL to train and fine-tune BERT. The example can be based on https://github.com/pytorch/text/tree/master/examples/BERT.

Support new PyTorch Automatic Mixed Precision

PyTorch 1.6 includes the stable release of automatic mixed precision (AMP). Make sure AdaptDL is fully compatible with this feature.

https://pytorch.org/docs/stable/notes/amp_examples.html

Confusion about Distributed Training

When I tried to run distributed training, I am confused the description in the documentation.

Why do we need two shell commands to run distributed training?
Or the second command is used to kill the first command program?

Could you clarify it?

Support AdaScale for Adam-type optimizers

The current adaptive learning rate scheduler (AdaScale) only supports the SGD optimizer. Since Adam(W) is widely used in various deep learning tasks, including NLP, GANs, and RL, having the adaptive learning rate scheduler for Adam would be useful.

[Pollux, Reproducibility, Inquiry] Are dataset-fetching mechanisms broken?

Hi, I am trying to run the pollux benchmark with custom workload and a different cluster (one that is not aws), to evaluate how pollux does in a variety of situations. However, I cannot seem to pull from your docker registry at registry.petuum.com, which is needed to assemble the containers for each of the six models. (See this directory, for example )

Below is a part of what kubectl describe pods outputs for the dataset pod, after I successfully launch the three kinds of sched pods.

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m2s                default-scheduler  Successfully assigned default/datasets-jxz86 to elsa-05
  Normal   Pulling    53s (x3 over 2m)    kubelet            Pulling image "registry.petuum.com/dev/esper-datasets:latest"
  Warning  Failed     38s (x3 over 104s)  kubelet            Failed to pull image "registry.petuum.com/dev/esper-datasets:latest": rpc error: code = Unknown desc = Error response from daemon: Get https://registry.petuum.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Failed     38s (x3 over 104s)  kubelet            Error: ErrImagePull
  Normal   BackOff    9s (x4 over 104s)   kubelet            Back-off pulling image "registry.petuum.com/dev/esper-datasets:latest"
  Warning  Failed     9s (x4 over 104s)   kubelet            Error: ImagePullBackOff

I tried just pulling an image as well, and I got what you can see below. I am starting to think that maybe some undocumented procedure(eg. registration) is required to access registry.petuum.com...?

> ping registry.petuum.com
PING ec2-54-245-165-47.us-west-2.compute.amazonaws.com (54.245.165.47) 56(84) bytes of data.

^C
> sudo docker pull registry.petuum.com/dev/esper-datasets:latest

Error response from daemon: Get https://registry.petuum.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I googled a bit, and tested some of the more common solutions:

switching DNS servers to 8.8.8.8 or 8.8.4.4
configure a separate HTTP proxy

Regrettably, the former did not work, and it turns out the latter is not an option given my circumstances.

How can I proceed if I want to pull images from your server, and/or download the datasets you used in the evaluations in the paper?

Thank you in advance!

Short-circuit allocation when new job is immediately schedulable

Currently the allocator triggers its allocation policy at a fixed time interval (default 60s). This is useful for periodically re-optimizing the resource allocations, but new jobs also need to wait for the next allocation cycle to start. When there are enough resources available for the new job, it should be possible to immediately schedule it.

Possible implementation:

Change sched/allocator.py to continually watch for new jobs, and trigger the allocation policy for each new job.
Add a new method allocate_job(self, job_info, node_info_list) which returns a list[str] if the job can be allocated immediately, and None otherwise.

Scheduler should respect namespace resource quotas

The AdaptDL scheduler currently assumes no resource quotas are set on namespaces running AdaptDLJobs, and has undefined behavior if this assumption is not satisfied. Given resource quotas are a common utility provided by Kubernetes, AdaptDL should respect them, and not schedule more pods in a namespace than can fit in the quota (if set).

Implement a custom (pod) scheduler for AdaptDL

Current AdaptDL controller is a mediator between what the allocator wants and what the k8s default scheduler has or can do. It tries to reconcile the jobs states so that both the parties agree on it in a dynamic environment. It gets the proposed allocations from the allocator and tries to apply it to the cluster using the default scheduler. It has to handle (pod) failures in doing so. Some of which are spurious others can be fatal. Spurious failures include, controller trying to schedule pod from job A when there is already a pod for job B on the node trying to exit (The jobs can move from state A to state B in arbitrary order based on allocator's decision and how controller decides to perform the reconciliation). In this case controller has to ignore this failure and retry. Other failures like real pod failure can be fatal. Currently the controller completely relies on the default scheduler to give it hints (in terms of pod states) about what has actually happened when it created/deleted a pod. This makes it hard for controller to (confidently) know which failure is ephemeral and which can't be ignored.

Currently there is also reliance on allocator to "fix" the allocations based on newly available information about the cluster resources. For example, if allocator decides to give node X to job A and a non-AdaptDL pod takes the node before controller has a chance to use it, it results in a spurious failure that is indirectly handled by the allocator by proposing new allocation in next cycle which avoids node X. But job A does not fail. This mechanism doesn't work with non-preemptible jobs where usually we won't allow a reallocation. The controller has to be smart enough to understand that the pod failure in this case qualifies for a reallocation because the non-preemptible job hasn't started execution yet. Currently it cannot do that.

What we want from the custom scheduler is to be in control of states of all the pods for all jobs so that we know exactly what each (pod) failure means and can easily handle interference by default k8s scheduler or any other scheduler contending for the same resources. The custom scheduler will be responsible for spawning and destroying pods for a job. Handle interference by external schedulers better, by identifying it clearly. Better handle preemptible and non-preemptible job failures. This simplifies the controller's job to just create and delete pods as per allocator's requests.

Remove sudo requirement from adaptdl submit

Currently, the adaptdl submit command requires sudo to modify the /etc/hosts file to assign a consistent name to the insecure docker registry included with the AdaptDL scheduler. This consistent name is added as an insecure registry to the Docker daemon.json and as a tag for built images. This is not ideal since, for security reasons, we don't want the CLI to use sudo on client machines.

One potential solution is to (for the duration of the submit command) proxy/port-forward the insecure registry to localhost, which should be automatically trusted by the local Docker client, thus no longer necessary to modify daemon.json.

The meaning of progress

When I implement trace collection in your osdi21-artifact branch.
I am a little confuse the meaning of progress in validation-2048.csv.
I obtain progress information via get_progress.
However, it seems not mismatch your released trace information.
My collected batch=128 trace information.

My collected batch=2048 trace information.

May I ask some suggestions to mitigate the mismatch?

Exception in _fit_perf_params

Ran ./tests/long-workload/resnet18-cifar10-elastic.sh and saw the following exception after some time.

Epoch: 3
Train: {'loss_sum': 53504.6730735302, 'total': 67024, 'correct': 48132, 'loss_avg': 0.7982912549762801, 'accuracy': 0.7181308188111721}
Valid: {'loss_sum': 9128.654016554356, 'total': 10000, 'correct': 6920, 'loss_avg': 0.9128654016554356, 'accuracy': 0.692}

Epoch: 4
Traceback (most recent call last):
  File "/root/examples/pytorch-cifar/main.py", line 172, in <module>
    train(epoch)
  File "/root/examples/pytorch-cifar/main.py", line 114, in train
    for inputs, targets in trainloader:
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/data.py", line 468, in __iter__
    break
  File "/opt/conda/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/data.py", line 311, in profile
    profile_step_commit(self.is_accumulation_step)
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/_metrics.py", line 67, in profile_step_commit
    _fit_perf_params()
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/_metrics.py", line 131, in _fit_perf_params
    accumulation_steps > 0, accumulation_time, compute_time)
  File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (6,) (5,) (5,)

what does _get_cluster_sizes function mean

when I read adaptdl_sched source code， I meat the _get_cluster_sizes function , I guess this function help for optimize the cluster resource utilization ratio.
But I can not understand how it work，can anyone explain for me.

thank you very much indeed

Benchmark Dataset for DeepSpeech2 in Pollux

Hi, in the osdi21-artifact branch, from the readme file of the benchmark session, it seems like the dataset supported is "AN4, TEDLIUM, Voxforge, Common Voice and LibriSpeech", and /data also contains scripts to download these raw datasets. But in the paper, the benchmark dataset is CMU-ARCTIC? I was just wondering how I'm able to get the benchmarking datasets (i.e. its corresponding preprocessing scripts) for DeepSpeech2 as described in table1 of the paper. Thanks!

Document AdaptDLJob CRD usage

Write a documentation page describing usage of the AdaptDLJob custom resource. Include information on what is automatically configured by the AdaptDL CLI and what needs to be manually configured if directly creating an AdaptDLJob custom resource.

Environment variable `ADAPTDL_SUBMIT_REPO` (external repo) doesn't work with a tag

export ADAPTDL_SUBMIT_REPO=registry.foo.com/dev/adaptdl-submit:latest and then adaptdl submit results in job failure with Message: PodTemplate "resnet18-cifar10-v2-elastic-5b4xl" is invalid: template.spec.containers[0].image: Required value.

Workaround:
Turns out removing the latest tag from the env variable does seem work.

Adaptive Batch Size for Single-GPU training

It seems that single-GPU training does not support adaptive batch size and utilizes the default init_batch_size. Any special consideration here?

adaptdl/adaptdl/adaptdl/goodput.py

Line 128 in d83e4ce

num_replicas == 1,

Version check between AdaptDL Trainer Lib and Scheduler

It'd be nice to have a version check on the library side at an early stage to ensure its compatibility with the scheduler, before it raises other errors due to the incompatibility.

Problem when installing adaptdl scheduler

Hi I am trying to install the scheduler with helm

sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true

However, the contents in the templates seem not to be installed. I tried to run ps aux | grep python, but there is no "adaptdl_sched.allocator" process or "adaptdl_sched.supervisor", "adaptdl_sched.validator". The schedulers' services seems to be okay:

$ kubectl describe services
Name:              adaptdl-adaptdl-sched
Namespace:         default
Labels:            app=adaptdl-sched
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-sched,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.104.151.75
IPs:               10.104.151.75
Port:              http  9091/TCP
TargetPort:        9091/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:                     adaptdl-registry
Namespace:                default
Labels:                   app=docker-registry
                          app.kubernetes.io/managed-by=Helm
                          chart=docker-registry-1.9.4
                          heritage=Helm
                          release=adaptdl
Annotations:              meta.helm.sh/release-name: adaptdl
                          meta.helm.sh/release-namespace: default
Selector:                 app=docker-registry,release=adaptdl
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.103.68.10
IPs:                      10.103.68.10
Port:                     registry  5000/TCP
TargetPort:               5000/TCP
NodePort:                 registry  32000/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>


Name:              adaptdl-supervisor
Namespace:         default
Labels:            app=adaptdl-sched
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-sched,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.111.1.168
IPs:               10.111.1.168
Port:              http  8080/TCP
TargetPort:        8080/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:              adaptdl-validator
Namespace:         default
Labels:            app=adaptdl-validator
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-validator,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.102.34.180
IPs:               10.102.34.180
Port:              https  443/TCP
TargetPort:        https/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.0.1
IPs:               10.96.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.5.0.4:6443
Session Affinity:  None
Events:            <none>

When I try to run an adaptdl job, this error always exists:

  File "run_workload.py", line 136, in <module>
    objs_api.create_namespaced_custom_object(*obj_args, job)
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
    return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '01c68c3b-393c-419a-9b81-d3393b80d47f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'abb23a1c-5ca0-4b97-a67f-58f65e44bf9d', 'Date': 'Wed, 15 Jun 2022 14:49:52 GMT', 'Content-Length': '521'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded"}]},"code":500}

I followed these commands to setup the environments:

CNI_VERSION="v0.8.2"
ARCH="amd64"
sudo mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz" | sudo tar -C /opt/cni/bin -xz
DOWNLOAD_DIR=/usr/local/bin
sudo mkdir -p $DOWNLOAD_DIR
CRICTL_VERSION="v1.22.0"

ARCH="amd64"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
RELEASE="v1.21.0"
ARCH="amd64"
cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}
RELEASE_VERSION="v0.4.0"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
systemctl enable --now kubelet

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -p ${HOME}/software/miniconda3
echo "export PATH=${HOME}/software/miniconda3/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc


sudo apt install conntrack

sudo snap install yq --channel=v3/stable
sudo kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p ~/.kube
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown -f -R $USER ~/.kube
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/operator.yaml
curl -s https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/cluster.yaml | /snap/bin/yq w - spec.storage.deviceFilter nvme0n1p2 | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/filesystem.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass.yaml
docker login -u ${var.docker_username} -p '${var.docker_password}'
kubectl create secret generic regcred --from-file=.dockerconfigjson=/home/ubuntu/.docker/config.json --type=kubernetes.io/dockerconfigjson
helm repo add stable https://charts.helm.sh/stable --force-update
conda env update -f ~/adaptdl/benchmark/environment.yaml # path

#install helm (https://helm.sh/docs/intro/install/)

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

# install scheduler
sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true

What is really strange is that everything seems to be fine:

NAME                                        READY   STATUS    RESTARTS   AGE
pod/adaptdl-adaptdl-sched-cbc794b8f-8xq2f   3/3     Running   0          36m
pod/adaptdl-registry-76d9c8b759-tqrdv       1/1     Running   0          36m
pod/adaptdl-validator-d878bc9c9-ddglc       1/1     Running   0          36m
pod/images-6jsfj                            6/6     Running   0          103m
pod/images-gldv7                            6/6     Running   0          103m
pod/images-lprhh                            6/6     Running   0          103m
pod/images-qsglk                            6/6     Running   0          103m

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
service/adaptdl-adaptdl-sched   ClusterIP   10.102.52.25     <none>        9091/TCP         36m
service/adaptdl-registry        NodePort    10.100.70.211    <none>        5000:32000/TCP   36m
service/adaptdl-supervisor      ClusterIP   10.111.108.197   <none>        8080/TCP         36m
service/adaptdl-validator       ClusterIP   10.98.19.54      <none>        443/TCP          36m
service/kubernetes              ClusterIP   10.96.0.1        <none>        443/TCP          7h6m

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/images   4         4         4       4            4           <none>          103m

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/adaptdl-adaptdl-sched   1/1     1            1           36m
deployment.apps/adaptdl-registry        1/1     1            1           36m
deployment.apps/adaptdl-validator       1/1     1            1           36m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/adaptdl-adaptdl-sched-cbc794b8f   1         1         1       36m
replicaset.apps/adaptdl-registry-76d9c8b759       1         1         1       36m
replicaset.apps/adaptdl-validator-d878bc9c9       1         1         1       36m

Do you know why would this happen? Thank you!

Strange outputs when running dcgan example

When I ran the dcgan.py in examples(autoscale batch size off), I found the outputs very strange and did not tend to converge:

But when I remove the following two rows:

netD = adl.AdaptiveDataParallel(netD, optimizerD, scheduleD, name="netD")
netG = adl.AdaptiveDataParallel(netG, optimizerG, scheduleG, name="netG")

The results seem better:

Could you please help me solve this? Is this may be caused by the warning related to the zero_grad?

Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.

Right now we checkpoint for rescaling by creating a sigint/sigterm handler, and then we catch the sigterm sent by Kubernetes when the adaptdl scheduler decides to terminate the worker pods. However, if the training process is not running at process 1, then it may not receive the sigterm, and checkpointing will not occur.

This means that the AdaptDL training must be the main command run in the container (i.e., not wrapping it a shell command)

Wont work: /bin/sh -c "python3 adaptdl_training_code.py"

Will work: python3 adaptdl_training_code.py

See https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Add ability to suspend a job without deleting the AdaptDLJob

Might be implemented by adding an additional state "Suspended". Suspending a job can be done by setting spec.suspended=true, and can be unsuspended later by setting spec.suspended=false or spec.suspended=nil.

Integrating with PyTorch Lightning

Hi,

I'm seeing if it's possible to integrate adaptdl with PyTorch Lightning (specifically the Deepspeech2 open source repo). Potential problems I see are:

Would the adaptdl specific the dataloader and the model be compatible?
Also, what should I do about the remaining_epochs_until iterator? Or, if I give up the remaining_epochs_until iterator and stop training at a specific validation metric, would that work?

Thanks a lot.

Progress in validation

Hi, I'm trying to execute the simulator in branch "osdi21-artifact" and I encountered some problems.

What does the progress in the traces//validation-.csv mean? Does it mean the training time? I found that, the training time of ncf with batch size=32768 should be about 100s or even less as the step time are about 0.02s and the iteration is 1548. However, in validation-32768.csv, the progress is 194285.

Thanks!

Ensure saving checkpoints is atomic

Currently, checkpoints can be corrupted if the Kubernetes termination grace period is exceeded and the pod is forcibly killed, which causes the job to crash after restart. We can make sure that the checkpoints are consistent by saving to a different directory and renaming it after the full checkpoint it saved.

hello_world can not run

I have installed the k8s (v1.18.2) in the local cluster and used helm(v2.17.0) to install adaptdl, adaptdl-sched successfully:

root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/mnist# kubectl get pod -A | grep adaptdl
adaptdl adaptdl-registry-697884b65-wf4w6 1/1 Running 0 17h
adaptdl jazzed-koala-adaptdl-sched-85d75fdb5d-9lvzq 3/3 Running 6 17h
adaptdl jazzed-koala-validator-98f8fcf7c-jj959 1/1 Running 0 17h
adaptdl peeking-ostrich-adaptdl-sched-667c78f9fb-fr2zj 3/3 Running 4 17h

and I write the hello_world protect the same as the introduction with the following structure:
└── hello_world
├── adaptdljob.yaml
├── Dockerfile
└── hello_world.py

I execute the "adaptdl submit hello_world" and get the following information:

/usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Using AdaptDL insecure registry.
Sending build context to Docker daemon 4.096kB
Step 1/4 : FROM python:3.7-slim
---> d3c9ad326043
Step 2/4 : RUN python3 -m pip install -i https://pypi.tuna.tsinghua.edu.cn/simple adaptdl
---> Using cache
---> 05dae174d67e
Step 3/4 : COPY hello_world.py /root/hello_world.py
---> Using cache
---> 10d12170490d
Step 4/4 : ENV PYTHONUNBUFFERED=true
---> Using cache
---> bc04efd29920
Successfully built bc04efd29920
Successfully tagged localhost:59283/adaptdl-submit:latest
Using default tag: latest
The push refers to repository [localhost:59283/adaptdl-submit]
2cab9519a560: Layer already exists
16f13637494a: Layer already exists
25ad0307b4c1: Layer already exists
874b45955cb1: Layer already exists
85c923303735: Layer already exists
d0fa20bfdce7: Layer already exists
2edcec3590a4: Layer already exists
latest: digest: sha256:7346ece45037f13481a30a50907418bbd460035f488a1aab3cfb0f8ebdf35644 size: 1790
W0126 21:25:38.652722 75926 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
Unsupported storageclass from available storageclasses []

and I execute "adaptdl ls" but cannot get the information about this demo:
root@k8s-master:/home/czq/Pollux/adaptdl_v2/examples/HelloWorld# adaptdl ls
/usr/lib/python3/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
No adaptdljobs
Name Status Start(UTC) Runtime Rplc Rtrt

I wonder how to cope with this problem and the job can correctly execute.

Large system overheads of AdaptDL

Hi, I am trying to use AdaptDL to adaptively change the batch size. However, I found that after using adaptDL, the training becomes much slower (6.656 iters/s v.s. 17.467 iters/s). However, according to the Pollux paper (OSDI), the system overhead should be small enough to be neglected. I found that the main overhead is in the "backward process" (loss.backward()) by profiling the time of every line of code. I don't know if it is because of the backward hook or is there anything wrong with my code.

I am using Python 3.6.13, torch 1.10.2, CUDA 11.6 on an A100 GPU.

Here is the code to reproduce the problem on one single GPU. Although only one GPU is needed for the code, I use PyTorch DDP for a fair comparison with adaptDL.

The code with adaptDL:

import math
import os
import time
import sys

import torch
import torch.distributed as dist
import torch.nn as nn
import argparse

import torchvision
import torchvision.transforms as transforms
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim.lr_scheduler import StepLR

import adaptdl # Changed in step 1
import adaptdl.torch # Changed in step 1


MAX_GLOBAL_BATCHSIZE = 1024


def main():
    # 0. set up distributed device and process group
    rank = int(os.environ["ADAPTDL_REPLICA_RANK"])
    adaptdl.torch.init_process_group("nccl" if torch.cuda.is_available()
                                     else "gloo") # Changed in step 1

    # 1. define network, optimizer and scheduler
    device = torch.device("cuda:" + str(rank))
    net = torchvision.models.vgg16()
    net = net.to(device)
    optimizer = torch.optim.SGD(
        net.parameters(),
        lr=0.001 * 2,
        momentum=0.9,
        weight_decay=0.0001,
        nesterov=True,
    )
    scheduler = StepLR(optimizer, step_size=1)
    net = adaptdl.torch.AdaptiveDataParallel(net, optimizer, scheduler) # Changed in step 1

    # 2. load dataset
    transform = transforms.Compose(
        [transforms.RandomHorizontalFlip(),
        transforms.RandomGrayscale(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    trainset = torchvision.datasets.CIFAR10(root="./", train=True,
                                    download=True, transform=transform)
    
    train_loader = adaptdl.torch.AdaptiveDataLoader(trainset, drop_last=True, batch_size=args.batch_size) # Changed in step 2

    if args.adaptive_batch_size != 0:
        train_loader.autoscale_batch_size(MAX_GLOBAL_BATCHSIZE, local_bsz_bounds=(args.batch_size / 4 + 1, args.batch_size * 2)) # Changed in step 3, optional

    # 3. define loss 
    criterion = nn.CrossEntropyLoss()

    # use time stamp to calculate scale overhead
    print("training iteration start : " + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    
    # 4. start to train
    net.train()
    best_acc = 0
    epochs = 2
    start_time = time.time()

    for epoch in adaptdl.torch.remaining_epochs_until(epochs): # Changed in step 4
        train_loss = correct = total = 0

        for idx, (inputs, targets) in enumerate(train_loader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = net(inputs)

            loss = criterion(outputs, targets)
            optimizer.zero_grad()
            # print(f"{rank} starting backward ...")
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            total += targets.size(0)
            correct += torch.eq(outputs.argmax(dim=1), targets).sum().item()

            if rank == 0 and idx % 20 == 0:
                print(
                    "   == Train Epoch: [{:3}/{}] [{}/{}] | loss: {:.3f} | acc: {:6.3f}%".format(
                        idx + 1,
                        len(train_loader),
                        epoch,
                        epochs,
                        train_loss / (idx + 1),
                        100.0 * correct / total,
                    )
                )
                print("throughput: {:.3f} iters/s".format((epoch * len(train_loader) + idx + 1)/(time.time() - start_time)))

        scheduler.step()

        
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Launch DDP processes")

    parser.add_argument('--local_rank', type=int, help='local rank of GPU')
    parser.add_argument('-b', '--batch_size', type=int, default=64, help='local batch size')
    parser.add_argument('--adaptive_batch_size', type=int, default=1, required=True,
        help='autoscale batch size')
    args = parser.parse_args()
    main()

To run this, the command line is:

ADAPTDL_MASTER_ADDR=127.0.0.1 ADAPTDL_MASTER_PORT=22223 ADAPTDL_NUM_REPLICAS=1 ADAPTDL_REPLICA_RANK=0 ADAPTDL_CHECKPOINT_PATH=./ ADAPTDL_NUM_RESTARTS=0 python vgg16_adaptdl.py --adaptive_batch_size 0

The code without adaptDL:

import math
import os
import time
import sys

import torch
import torch.distributed as dist
import torch.nn as nn
import argparse

import torchvision
import torchvision.transforms as transforms
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim.lr_scheduler import StepLR


MAX_GLOBAL_BATCHSIZE = 1024


def main():
    rank = int(os.environ["RANK"])
    torch.distributed.init_process_group("nccl" if torch.cuda.is_available()
                                     else "gloo") # Changed in step 1

    # 1. define network, optimizer and scheduler
    device = torch.device("cuda:" + str(rank))
    net = torchvision.models.vgg16()
    net = net.to(device)
    optimizer = torch.optim.SGD(
        net.parameters(),
        lr=0.001 * 2,
        momentum=0.9,
        weight_decay=0.0001,
        nesterov=True,
    )
    scheduler = StepLR(optimizer, step_size=1)
    net = DDP(net) # Changed in step 1

    # 2. load dataset
    transform = transforms.Compose(
        [transforms.RandomHorizontalFlip(),
        transforms.RandomGrayscale(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    trainset = torchvision.datasets.CIFAR10(root="./", train=True,
                                    download=True, transform=transform)
    # DistributedSampler
    train_sampler = torch.utils.data.distributed.DistributedSampler(
            trainset,
            shuffle=True,
            rank=0
        )
    train_loader = torch.utils.data.DataLoader(
            trainset,
            batch_size=args.batch_size,
            num_workers=1,
            pin_memory=True,
            sampler=train_sampler,
            drop_last=True,
        )
    

    criterion = nn.CrossEntropyLoss()

    # use time stamp to calculate scale overhead
    print("training iteration start : " + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    
    # 4. start to train
    net.train()
    best_acc = 0
    epochs = 2
    start_time = time.time()

    for epoch in range(epochs): # Changed in step 4
        train_loss = correct = total = 0

        for idx, (inputs, targets) in enumerate(train_loader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = net(inputs)

            loss = criterion(outputs, targets)
            optimizer.zero_grad()
            # print(f"{rank} starting backward ...")
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            total += targets.size(0)
            correct += torch.eq(outputs.argmax(dim=1), targets).sum().item()

            if rank == 0 and idx % 20 == 0:
                print(
                    "   == Train Epoch: [{:3}/{}] [{}/{}] | loss: {:.3f} | acc: {:6.3f}%".format(
                        idx + 1,
                        len(train_loader),
                        epoch,
                        epochs,
                        train_loss / (idx + 1),
                        100.0 * correct / total,
                    )
                )
                print("throughput: {:.3f} iters/s".format((epoch * len(train_loader) + idx + 1)/(time.time() - start_time)))

        scheduler.step()

        
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Launch DDP processes")
    parser.add_argument('--local_rank', type=int, help='local rank of GPU')
    parser.add_argument('-b', '--batch_size', type=int, default=64, help='local batch size')
    parser.add_argument('--adaptive_batch_size', type=int, default=1, required=True,
        help='autoscale batch size')
    args = parser.parse_args()
    main()

To run this, the command line is:

MASTER_ADDR=127.0.0.1 MASTER_PORT=22222 RANK=0 WORLD_SIZE=1 python vgg16.py --adaptive_batch_size 0

I would really appreciate any help!

Problems encountered during the installation of AdaptDL Helm Chart

when i installed AdaptDL Helm Chart by using this command :
helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace adaptdl --create-namespace --set docker-registry.enabled=true

i geted the following error:
Error: INSTALLATION FAILED: looks like "https://github.com/petuum/adaptdl/raw/helm-repo" is not a valid chart repository or cannot be reached: Get "https://github.com/petuum/adaptdl/raw/helm-repo/index.yaml": dial tcp 20.205.243.166:443: connect: connection timed out

please tell me how can i solve this error，thanks

Cluster autoscaler doesn't work when AdaptDL scheduler is deployed to namespaces other than "default"

Deployed AdaptDL scheduler to the "adaptdl" namespace and started a job. I found that multiple placeholder pods are created but they all get successfully scheduled to the same node, see attachment below:

Make Elastic Training Flexible to GPU Memory

Currently, the user needs to choose a local upper bound in adaptdl by themselves so that no memory issue is caused by too large batch size given the GPU memory resource, which should be automated by adaptdl to prevent the elastic training from breaking the GPU memory requirement in the future.

Plans to include UI for scheduler?

We are investigating adaptdl's command line interface to use as a scheduler for our gpu cluster and are wondering if there are plans to include a UI to view submitted jobs, output files, etc? I see an issue already for Kubeflow integration, which has a UI to display these sort of things, so that could be one solution.

Documentation Issues

For the Submitting a Simple Job tutorial:

In the Dockerfile, the second line COPY adaptdl adaptdl and the third line COPY examples/requirements.txt . are unnecessary and could be removed.
In the Dockerfile, RUN python3 -m pip install -e adaptdl gives an error and should be changed as RUN python3 -m pip install adaptdl.
In the Dockerfile, COPY hello_world/hello_world.py hello_world.py should be changed as COPY hello_world.py /root/hello_world.py.
In the hello_world.py, the function calling adaptdl.env.get_share_dir() should be changed as adaptdl.env.share_path().
To monitor job, adaptdl ls will lead to error. Should use kubectl get adaptdljobs instead.

For the Integrating your AdaptDL job with TensorBoard:

The function calling adaptdl.env.get_job_id() should be changed as adaptdl.env.job_id().
In the corresponding mnist_tensorboard.py in github repo, adaptdl.env.get_job_name() should be changed as adaptdl.env.job_id().

"CUDA error: invalid resource handle" on Standalone Training

There was a runtime error "CUDA error: invalid resource handle" when I run "mnist_step_5.py".
https://github.com/petuum/adaptdl/blob/master/tutorial/mnist_step_5.py

I would really appreciate it if you answer on how to solve this problem.

Error message

WARNING:adaptdl.reducer:Could not connect to root, trying again...
INFO:adaptdl.reducer:Master waiting for connections on 0
INFO:adaptdl.reducer:rank 0 connecting to 0.0.0.0 on port 47237
INFO:adaptdl.torch:Initializing torch.distributed using tcp://0.0.0.0:18068?rank=0&world_size=1
INFO:adaptdl.torch:torch.distributed initialized
/home/intern/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:363: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES.
  warnings.warn(
INFO:adaptdl.torch.epoch:starting at epoch 0
Train Epoch: 0 [0/60000 (0%)]   Loss: 2.333050
Train Epoch: 0 [640/60000 (1%)] Loss: 1.686436
...
...
...
Train Epoch: 0 [58880/60000 (98%)]      Loss: 0.128815
Traceback (most recent call last):
  File "/home/intern/.local/lib/python3.8/site-packages/adaptdl/utils.py", line 27, in wrapper
    return function(*args, **kwargs)
  File "/home/intern/.local/lib/python3.8/site-packages/adaptdl/torch/parallel.py", line 144, in _final_callback
    profile_sync_time(self._sync_start.elapsed_time(sync_end) / 1e3)
  File "/home/intern/.local/lib/python3.8/site-packages/torch/cuda/streams.py", line 173, in elapsed_time
    return super(Event, self).elapsed_time(end_event)
RuntimeError: CUDA error: invalid resource handle
Traceback (most recent call last):
  File "mnist_step_5.py", line 144, in <module>
    main()
  File "mnist_step_5.py", line 135, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "mnist_step_5.py", line 46, in train
    loss.backward()
  File "/home/intern/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/intern/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError

Environment

OS: Ubuntu 16.04.6

Python Version: Python 3.8.1
torch Version: 1.6.0+cu101

NVIDIA-SMI: 418.56
Driver Version: 418.56
CUDA Version: 10.1
Using 4GPUs

Migrate scheduler dev workflow from Makefile to Skaffold

Automated testing for the Controller

The AdaptDLJob controller sits at the critical path for every job lifecycle, it can be subject to unexpected side-effects due to interactions with diverse Kubernetes environments, failures, race conditions, etc. We need a way to automatically test the controller and its various interactions with the Kubernetes environment.

Dual license under BSD-3

For better compatibility with the PyTorch ecosystem, we should dual-license under BSD 3-Clause.

Discussion at: facebookresearch/fairscale#111

A few problems when reproducing the benchmark

Hi, I am trying to learn from your open-source benchmark in branch "osdi21-artifact" and I am encountering some problems.

First of all, I constantly encounter this when building docker image with run_workload.py

The push refers to repository [localhost:32000/pollux]
Get "http://localhost:32000/v2/": dial tcp 127.0.0.1:32000: connect: connection refused

It seems that the scheduler uses 32000 by default. I tried to run

sudo helm repo add stable https://charts.helm.sh/stable
helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true

to install the scheduler, but I got this:

I also tried to run make in the path od the repo (adaptdl ), but I always get this.

helm status adaptdl-registry || \
helm install adaptdl-registry stable/docker-registry \
	--set fullnameOverride=adaptdl-registry \
	--set service.type=NodePort \
	--set service.nodePort=32000
Error: release: not found
Error: INSTALLATION FAILED: failed to download "stable/docker-registry"
Makefile:17: recipe for target 'registry' failed
make: *** [registry] Error 1

I tried to follow the README.md in helm-repo branch to install:

helm install stable/docker-registry

But it failed as well:

Error: INSTALLATION FAILED: must either provide a name or specify --generate-name

May I ask how can I get the scheduler running to reproduce some results in the benchmark?

I always get 404 when running run_monitor.

Traceback (most recent call last):
  File "run_monitor.py", line 20, in <module>
    obj_list = objs_api.list_namespaced_custom_object(*obj_args)
......
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '090acf94-a2a1-44f4-bc01-f45fb5743802', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2315d8c9-824a-412f-9e0f-2025b94ff2c5', 'Date': 'Wed, 25 May 2022 09:11:45 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

How can I run the monitor? Do I need to change obj_args = ("esper.petuum.com", "v1", namespace, "esperjobs") to obj_args = ("adaptdl.petuum.com", "v1", namespace, "adaptdljobs") according to run_workload.py?

Thank you!