aws / sagemaker-pytorch-training-toolkit Goto Github PK

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.

License: Apache License 2.0

Shell 0.85% Python 97.87% C 1.28%

aws sagemaker pytorch docker

sagemaker-pytorch-training-toolkit's Introduction

SageMaker PyTorch Training Toolkit

SageMaker PyTorch Training Toolkit is an open-source library for using PyTorch to train models on Amazon SageMaker.

This toolkit depends and extends the base SageMaker Training Toolkit with PyTorch specific support.

For inference, see SageMaker PyTorch Inference Toolkit.

For the Dockerfiles used for building SageMaker PyTorch Containers, see AWS Deep Learning Containers.

For information on running PyTorch jobs on Amazon SageMaker, please refer to the SageMaker Python SDK documentation.

For notebook examples: SageMaker Notebook Examples.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

SageMaker PyTorch Training Toolkit is licensed under the Apache 2.0 License. It is copyright 2018 Amazon .com, Inc. or its affiliates. All Rights Reserved. The license is available at: http://aws.amazon.com/apache2.0/

sagemaker-pytorch-training-toolkit's People

Contributors

Stargazers

Watchers

Forkers

laurenyu jmazanec15 mvsusp tanthml iquintero multiplecrashes ourobouros jesterhazy vdantu yangaws lanlan555 icywang86rui homma-ge domino14 elgalu fbbradheintz kbb99 cnxtech burcturkoglu ynumoor capdevc lianyiding leolorenzoluis roshrini kalyc bdurepo1 sofiaelenahopartean unifiedcompliance fededosreis bimhud kbhat1 soundarya0 saimidu ddavydenko nskool abhinavs95 knakad arjkesh danabens vandanavk classicsong nihalharish githubmg alexucb tusharkanekidey unifyd-insights muhyun jamiekang rosssong akshitagarwal1998 mrkulk paulrigor chaibapchya paddyind rick79979 bobby484 knightz33 okwrtdsh romankoles futurev notbarrie qpc-database satishpasumarthi nelsontseng0704 lokiiiiii gcjordi vishwakaria muskanmahajan486 edsun3941 nish21 test-mass-forker-org-1 nish2104 cyberitech mahwiah gilinachum trellixvulnteam lxning vdreamakitex sorgina13 yongyanrao yl-to carljeske emeraldbay seanpm2001 mufaddal-rohawala staubhp lroblesm

sagemaker-pytorch-training-toolkit's Issues

Running local training inside docker container (No /opt/ml/input/config/resourceconfig.json)

Hi there,

Given the following simplified setup, I can't get local training to work inside Docker. Outside of Docker it works fine, but when I run within the container it blows up with the error below.

docker-compose.yml

---
version: '2'

services:
  training:
    build:
      context: .
    stdin_open: true
    tty: true
    volumes:
      - ../:/root/code
      - $HOME/.aws:/root/.aws
      - /var/run/docker.sock:/var/run/docker.sock

I've tried this with 520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-pytorch:1.0.0-cpu-py3Docker image and also other base images, the error is always the same.

I have a simple train.py script which gets mounted in /root/code, I docker-compose up
and inside the container run python train.py.

The output is:

root@8ce53fb18e5c:/root/code# python train.py 
Creating tmp6d4_wpt6_algo-1-nexl2_1 ... done
Attaching to tmp6d4_wpt6_algo-1-nexl2_1
algo-1-nexl2_1  | jq: error: Could not open file /opt/ml/input/config/resourceconfig.json: No such file or directory
algo-1-nexl2_1  | changehostname.c: In function ‘gethostname’:
algo-1-nexl2_1  | changehostname.c:15:21: error: expected expression before ‘;’ token
algo-1-nexl2_1  |    const char *val = ;
algo-1-nexl2_1  |                      ^
algo-1-nexl2_1  | gcc: error: changehostname.o: No such file or directory
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
algo-1-nexl2_1  | Reporting training FAILURE
algo-1-nexl2_1  | framework error: 
algo-1-nexl2_1  | Traceback (most recent call last):
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 47, in train
algo-1-nexl2_1  |     env = sagemaker_containers.training_env()
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 26, in training_env
algo-1-nexl2_1  |     resource_config=_env.read_resource_config(),
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 237, in read_resource_config
algo-1-nexl2_1  |     return _read_json(resource_config_file_dir)
algo-1-nexl2_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 193, in _read_json
algo-1-nexl2_1  |     with open(path, 'r') as f:
algo-1-nexl2_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-nexl2_1  | 
algo-1-nexl2_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
tmp6d4_wpt6_algo-1-nexl2_1 exited with code 2

My issue is therefore - how to run local training in a dockerisied environment? Obviously the training is in Docker but I'd also like to dockerise the environment which the pre training script runs in. This is due to the need to pin down dependencies and reduce on boarding time etc

"Train": executable file not found in $PATH

BUG Description
I am facing an error that does not give any direction to resolve it when migrating to run on Sagemaker.

The code runs perfectly on the local machine.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior
The model is expected to start to train and log metrics and losses.

Screenshots or logs

Cloning into '/tmp/tmpycpzvkcn'...
remote: Enumerating objects: 246, done.
remote: Counting objects: 100% (246/246), done.
remote: Compressing objects: 100% (190/190), done.
remote: Total 246 (delta 40), reused 232 (delta 29), pack-reused 0
Receiving objects: 100% (246/246), 39.10 MiB | 27.69 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Branch 'sagemaker' set up to track remote branch 'sagemaker' from 'origin'.
Switched to a new branch 'sagemaker'
[2023-10-12 19:22:15,073][sagemaker][INFO] - Creating training-job with name: xmtc-2023-10-13-02-22-09-781
[2023-10-12 19:22:15,116][sagemaker.local.image][INFO] - 'Docker Compose' found using Docker CLI.
[2023-10-12 19:22:15,117][sagemaker.local.local_session][INFO] - Starting training job
[2023-10-12 19:22:15,118][sagemaker.local.image][INFO] - Using the long-lived AWS credentials found in session
[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-55row:
    command: train
    container_name: 1l7x1nzly6-algo-1-55row
    environment:
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    - '[Masked]'
    image: 179395270822.dkr.ecr.us-east-2.amazonaws.com/xmtc:prototype
    networks:
      sagemaker-local:
        aliases:
        - algo-1-55row
    stdin_open: true
    tty: true
    volumes:
    - /tmp/tmpsvd2b_wm/algo-1-55row/output/data:/opt/ml/output/data
    - /tmp/tmpsvd2b_wm/algo-1-55row/input:/opt/ml/input
    - /tmp/tmpsvd2b_wm/algo-1-55row/output:/opt/ml/output
    - /tmp/tmpsvd2b_wm/model:/opt/ml/model
version: '2.3'

[2023-10-12 19:22:15,121][sagemaker.local.image][INFO] - docker command: docker compose -f /tmp/tmpsvd2b_wm/docker-compose.yaml up --build --abort-on-container-exit
time="2023-10-12T19:22:15-07:00" level=warning msg="a network with name sagemaker-local exists but was not created for project \"tmpsvd2b_wm\".\nSet `external: true` to use an existing network"
 Container 1l7x1nzly6-algo-1-55row  Creating
 Container 1l7x1nzly6-algo-1-55row  Created
Attaching to 1l7x1nzly6-algo-1-55row
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "train": executable file not found in $PATH: unknown
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 296, in train
    _stream_output(process)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 984, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_on_sagemaker.py", line 28, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 1311, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/estimator.py", line 2374, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 941, in train
    self._intercept_create_request(train_request, submit, self.train.__name__)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 5618, in _intercept_create_request
    return create(request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/session.py", line 939, in submit
    self.sagemaker_client.create_training_job(**request)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/local_session.py", line 203, in create_training_job
    training_job.start(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/entities.py", line 243, in start
    self.model_artifacts = self.container.train(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/sagemaker/local/image.py", line 301, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker', 'compose', '-f', '/tmp/tmpsvd2b_wm/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

System information
A description of your system. Please provide:

SageMaker Python SDK version: sagemaker 2.192.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
Python version: Python 3.10
Docker: 24.0.6
Custom Docker image (Y/N): Yes, on ECR.

fastai libraries have different versions between cpu and gpu images

Any specific reason why the GPU based Docker image is pinned to the fastai library 1.0.39 and the CPU image has the latest version (currently 1.0.49)? Means exporting models is challenging if using the GPU image for training and CPU image for inference. Would be good to have the latest version for the GPU image as well

fastai latest version(1.0.51) pin

Hi Team,

Right now fastai has version# 1.0.51, Also all the latest fastai docs based on the latest release. Is there any timeline we can have fastai version pinned with pytorch container? Right now in Pytorch container it has 1.0.39, it would be helpful if you can upgrade the latest version with this container.

Getting cudnn error while training on ml.p2.xlarge instance

I am training my model on ml.p2.xlarge and I am using the conda_amazonei_tensorflow2_p36 notebook . I updated the tensorflow to 2.3.0 and keras to 2.4.3 and installed keras-unet package(through which I am applying unet) but while trying to train, I am getting an error :

WARNING:tensorflow:From <ipython-input-16-a832c74cfe86>:7: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/3
---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-16-a832c74cfe86> in <module>
      5 
      6     validation_data=(x_val, y_val),
----> 7     callbacks=[callback_checkpoint]
      8 )

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py in new_func(*args, **kwargs)
    322               'in a future version' if date is None else ('after %s' % date),
    323               instructions)
--> 324       return func(*args, **kwargs)
    325     return tf_decorator.make_decorator(
    326         func, new_func, 'deprecated',

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1477         use_multiprocessing=use_multiprocessing,
   1478         shuffle=shuffle,
-> 1479         initial_epoch=initial_epoch)
   1480 
   1481   @deprecation.deprecated(

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
     64   def _method_wrapper(self, *args, **kwargs):
     65     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66       return method(self, *args, **kwargs)
     67 
     68     # Running inside `run_distribute_coordinator` already.

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    846                 batch_size=batch_size):
    847               callbacks.on_train_batch_begin(step)
--> 848               tmp_logs = train_function(iterator)
    849               # Catch OutOfRangeError for Datasets of unknown size.
    850               # This blocks until the batch has finished executing.

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    642         # Lifting succeeded, so variables are initialized and we can run the
    643         # stateless function.
--> 644         return self._stateless_fn(*args, **kwds)
    645     else:
    646       canon_args, canon_kwds = \

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
   2418     with self._lock:
   2419       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2420     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2421 
   2422   @property

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs)
   1663          if isinstance(t, (ops.Tensor,
   1664                            resource_variable_ops.BaseResourceVariable))),
-> 1665         self.captured_inputs)
   1666 
   1667   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1744       # No tape is watching; skip to running the function.
   1745       return self._build_call_outputs(self._inference_function.call(
-> 1746           ctx, args, cancellation_manager=cancellation_manager))
   1747     forward_backward = self._select_forward_and_backward_functions(
   1748         args,

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    596               inputs=args,
    597               attrs=attrs,
--> 598               ctx=ctx)
    599         else:
    600           outputs = execute.execute_with_cancellation(

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node model/conv2d/Conv2D (defined at <ipython-input-16-a832c74cfe86>:7) ]] [Op:__inference_train_function_9610]

Function call stack:
train_function

Not able to find any solution. Please guide.

RuntimeError in training a model of resnet152 using transfer learning: "models cannot register a hook on a tensor that doesn't require gradient"

Hi,

I am struggling with a strange error in my transfer learning model. The pytorch estimator is giving RuntimeError: cannot register a hook on a tensor that doesn't require gradient error both on GPU and CPU sagemaker instances. I cannot find any solution for this error.

Traceback (most recent call last):
  File "run_aws.py", line 159, in 
    _train(parser.parse_args())
  File "run_aws.py", line 116, in _train
    log_interval=log_interval
  File "/opt/ml/code/trainer.py", line 25, in fit
    train_loss, metrics = train_epoch(train_loader, model, loss_fn, optimizer, cuda, log_interval, metrics)
  File "/opt/ml/code/trainer.py", line 61, in train_epoch
    outputs = model(*data)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 531, in _call_
    hook.register_hook(self)
  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 194, in register_hook
    self.register_module(module)
  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 227, in register_module
    self._backward_apply(module)
  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 186, in _backward_apply
    param.register_hook(self.backward_hook(pname))
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 227, in register_hook
    raise RuntimeError("cannot register a hook on a tensor that "
RuntimeError: cannot register a hook on a tensor that doesn't require gradient

My notebook codes are below:

import sagemaker
import os

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()

# Define IAM role
role = sagemaker.get_execution_role()

bucket = 'shoe-dataset'
prefix = 'sagemaker/singleColor'

data_location = 's3://{}/{}'.format(bucket, prefix)
!pygmentize run_aws.py

estimator = PyTorch(entry_point='run_aws.py',
                    source_dir='.',
                    role=role,
                    framework_version='1.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge')

estimator.fit({
    'train': os.path.join(data_location, 'train'), 
    'test': os.path.join(data_location, 'test')
})

run_aws.py code is here:
https://github.com/FurkanArslan/deep-shoe-implementation/blob/master/run_aws.py

Here is the actual code that gives an error when forwarding the resnet152 model: (line 61):
https://github.com/FurkanArslan/deep-shoe-implementation/blob/master/trainer.py

Dockerfile installation of torch and torchvision from s3, replacing original versions.

What did you find confusing?
In the Dockerfile.gpu, there is a point where torch and torchvision are uni-nstalled, to be replaced with the re-installed specialized version of both packages from:

here (torch) ,and
here torchvision

Around line 138, in the Dockerfile.

Describe how documentation can be improved
Why are these replaced? During image building, this process is time-consuming. What what happen if I removed these versions and kept the original?

Worker initialization

It seems like there is a bug in initialization logic.
Gunicorn processes are initialized not at container start but at a time of first request arrival.
The global app variable here is not shared between gunicorn processes, so each process will be initialized only at a request arrival.

This will cause a random behavior. If the request comes to a worker that was already initialized - it will be processed quickly. If the request comes to a worker that is not yet initialized - the response will be delayed for quite some time (>30 sec in my case).
This will even cause /ping requests to time out and inability to deploy a container to AWS.

requirements.txt not working

My training job fails with the following error:

ModuleNotFoundError: No module named 'ignite'

I wrote pytorch-ignite==0.3.0 in src/requirements.txt in my project.

The directory composition of my project is:

.
├── exec_train_job.py
└── src
    ├── requirements.txt
    └── train.py

I executed the job via exec_train_job.py using aws/sagemaker-python-sdk's PyTorch Estimator with the following arguments:

framework_version=1.4.0
source_dir="src"
entry_point="train.py"

I executed it from the current directory (.)

$ python exec_train_job.py

The cause of the failure may be not recognizing the requirements.txt although it is under the same directory with the entry point (src including train.py).
The CloudWatch log shows:

sagemaker-containers INFO     Installing module with the following command:
/opt/conda/bin/python -m pip install .

I saw aws/sagemaker-containers/src/sagemaker_containers/_modules.py and know that requirements.txt is used when it exists in /opt/ml/code(=source_dir)

I also checked that s3://{default backet}/{job name}/source/sourcedir.tar.gz includes the requirements.txt, indicating that the requirements.txt has been transferred into opt/ml/code.

Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint

Describe the bug
At

sagemaker-pytorch-training-toolkit/src/sagemaker_pytorch_container/training.py

Line 48 in 88ca48a

_set_nccl_environment(training_environment.network_interface_name)

and

sagemaker-pytorch-training-toolkit/src/sagemaker_pytorch_container/training.py

Line 50 in 88ca48a

_set_distributed_environment(training_environment.hosts)

some environment variables are set in os.environ for NCCL and distributed training.

However, os.environ is not included when the entrypoint is called at

sagemaker-pytorch-training-toolkit/src/sagemaker_pytorch_container/training.py

Line 71 in 88ca48a

env_vars=training_environment.to_env_vars(),

. Only training_environment.to_env_vars() is set as the env_vars for the entrypoint, essentially discarding the os.environ vars set in the above 2 lines for NCCL and distributed training.

Expected behavior
The env vars passed at

sagemaker-pytorch-training-toolkit/src/sagemaker_pytorch_container/training.py

Line 71 in 88ca48a

env_vars=training_environment.to_env_vars(),

should include the environment variables set for NCCL and distributed training.

Pytorch Lightning pkg pin request in AWS sagemaker Pytorch base container

Pytorch Lightning is very popular pytorch module which is extensessively used in many projects. Is there any plan to add hat pkg in AWS sagemaker base container? If so then wahts is the desdline?

Training on GPU with a custom container based on official pytorch-training container

TL;DR

We train on a custom container based on the official pytorch-training:1.3.1-gpu-py3 container. The training script runs just fine, but it doesn't utilize the GPU. The GPU Utilization according to CloudWatch sits at 0%, even though the GPU Memory is being used. My question: is there anything special we need to do to make sure the GPU is used?

And, yes, we use .to('cuda:0') liberally throughout the code, and verify with logging (see details).

Thanks!

Details

I am using a custom container to train a pytorch model. The custom Dockerfile uses the following base container:

763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training:1.3.1-gpu-py3

Here is the Dockerfile (cutting out some custom package installs):

FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training:1.3.1-gpu-py3
LABEL maintainer="Aware Behavioral Intelligence"
WORKDIR /

[snip]

COPY model /opt/ml/code
ENV PATH="/opt/ml/code:${PATH}"
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
ENV SAGEMAKER_PROGRAM train
WORKDIR /opt/ml/code

As you can see from SAGEMAKER_PROGRAM, the script that is run is named train.

We submit our jobs using a boto3.client(service_name='sagemaker') client with the following (approximate) request:

response = client.create_training_job(
    TrainingJobName=SM_JOB_NAME,
    HyperParameters=HYPERPARAMETERS,
    AlgorithmSpecification={
        'TrainingImage': 'xxxxxxxxxxxx.dkr.ecr.us-east-2.amazonaws.com/sagemaker-cv-trainer',
        'TrainingInputMode': 'File',
        'MetricDefinitions': []
    },
    RoleArn=SM_ARN,
    InputDataConfig=[
        SM_TRAIN_INPUT,
        SM_DEV_INPUT
    ],
    OutputDataConfig=SM_OUTPUT,
    ResourceConfig={        
        'InstanceType': 'ml.p3.2xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 10
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 1*24*60*60
    },
    Tags=[
        {
            'Key': 'model',
            'Value': 'cv'
        }
    ]
)

Logging shows that pytorch detects the GPU:

For each training batch / prediction batch, I make sure the data are on the GPU:

And data do appear to get moved to the GPU (but the GPU is otherwise not used):

Any insights into how we may be using Sagemaker incorrectly? We have been training and deploying models on Sagemaker using these same patterns for the last year, and they have worked fine, but now we are in need of GPU resources for computer vision tasks.

Thank you.

Using CUDA version 10.1

I'm having some compatibility issues when using pytorch-1.1.0 and cuda 9. What should I include in the Dockerfile in order to use 10.1?

Thanks.

Error importing torchaudio

I'm trying to install torchaudio inside the PyTorch container and run into this error. Looking at online forums indicate that multiple torch versions or CUDA issues lead to this error. I tried installing a version that is compatible with the existing torch version (1.6.0) in the container, but it failed with the same error.

Traceback (most recent call last):
  File "train_model.py", line 32, in <module>
    import torchaudio
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/__init__.py", line 1, in <module>
    from . import extension
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/__init__.py", line 5, in <module>
    _init_extension()
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/extension.py", line 12, in _init_extension
    _init_script_module(ext)
  File "/opt/conda/lib/python3.6/site-packages/torchaudio/extension/extension.py", line 19, in _init_script_module
    torch.classes.load_library(path)
  File "/opt/conda/lib/python3.6/site-packages/torch/_classes.py", line 46, in load_library
    torch.ops.load_library(path)
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/opt/conda/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/conda/lib/python3.6/site-packages/torchaudio/_torchaudio.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs

model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment

Hello everyone, I'm very new on sagemaker and I'm facing a strange issue that I can't solve.

My goal : I have created a CNN that I would like to train, build and deploy in a MLOPS pipeline with sagemaker.

First of all, I created a notebook instance in SageMaker in wich i created a wasteClassification.ipynb and a train.py file.
The train.py file contain my neural network definition, some function to train and save it and several overwritted function : model_fn, predict_fn, input_fn. In my wasteClassification.ipynb I was able to create a PyTorch estimator, train the model, deploy the endpoint and make prediction using invoke_endpoint function without any issues.

After that, i decided to create a pipeline to automate training, building and deployment using the new sagemaker tool for that.
I have created a sagemaker studio project based on the template MLOps template for model building, training, and deployment. This template provides two gitCommit repos : modelbuild and modeldeploy. I simply modified the modelbuild repo in wich I put my train.py script in the folder "/pipelines/abalone/" and I modified the file "pipelines/abalone/pipeline.py" in which I created a pytorch estimator linked to my train.py script.
When the pipeline is lauched, I can see in the training job logs that my model is training without any issue and the final endpoint is created. But when I try to invoke the endpoint (invoke_endpoint), I have an error : An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "
Please provide a model_fn implementation."
This is strange because I did provide a model_fn implementation in my train.py file...

Do you have any idea to solve this issue ?

Need help writing 'serve' exec file for custom pytorch container

Hi,

I created custom container for sagemaker pytorch and was able to create training job. But while deploying the model, i am getting 'ping health check' error and didnt find anything in logs.

I included model_fn(), input_fn(), predict_fn(), output_fn() in training script. But I dont understand how to write 'serve' exec file. I think this is where i am going wrong. I refered to https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py to write serve file and is as follows:
I am getting "no module named 'sagemaker_containers' error".
It would be great if you can give some suggestions how to write serve file....

#!/usr/bin/env python3

from __future__ import absolute_import

import logging
from sagemaker_containers.beta.framework import (content_types, encoders, env, modules, transformer, worker)

import torch

logger = logging.getLogger(__name__)

def default_model_fn(model_dir):
    """Loads a model. For PyTorch, a default function to load a model cannot be provided.
    Users should provide customized model_fn() in script.
    Args:
        model_dir: a directory where model is saved.
    Returns: A PyTorch model.
    """
    return transformer.default_model_fn(model_dir)


def default_input_fn(input_data, content_type):
    """A default input_fn that can handle JSON, CSV and NPZ formats.
    Args:
        input_data: the request payload serialized in the content_type format
        content_type: the request content_type
    Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    np_array = encoders.decode(input_data, content_type)
    tensor = torch.FloatTensor(
        np_array) if content_type in content_types.UTF8_TYPES else torch.from_numpy(np_array)
    return tensor.to(device)


def default_predict_fn(data, model):
    """A default predict_fn for PyTorch. Calls a model on data deserialized in input_fn.
    Runs prediction on GPU if cuda is available.
    Args:
        data: input data (torch.Tensor) for prediction deserialized by input_fn
        model: PyTvorch model loaded in memory by model_fn
    Returns: a prediction
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    input_data = data.to(device)
    model.eval()
    with torch.no_grad():
        output = model(input_data)
    return output


def default_output_fn(prediction, accept):
    """A default output_fn for PyTorch. Serializes predictions from predict_fn to JSON, CSV or NPZ 
format.
    Args:
        prediction: a prediction result from predict_fn
        accept: type which the output data needs to be serialized
    Returns: output data serialized
    """
    if type(prediction) == torch.Tensor:
        prediction = prediction.detach().cpu().numpy()

    return worker.Response(encoders.encode(prediction, accept), accept)


def _user_module_transformer(user_module):
    model_fn = getattr(user_module, 'model_fn', default_model_fn)
    input_fn = getattr(user_module, 'input_fn', default_input_fn)
    predict_fn = getattr(user_module, 'predict_fn', default_predict_fn)
    output_fn = getattr(user_module, 'output_fn', default_output_fn)

    return transformer.Transformer(model_fn=model_fn, input_fn=input_fn, predict_fn=predict_fn,
                               output_fn=output_fn)


app = None


def main(environ, start_response):
    global app
    if app is None:
        serving_env = env.ServingEnv()
        user_module = modules.import_module(serving_env.module_dir, serving_env.module_name)
        user_module_transformer = _user_module_transformer(user_module)
        user_module_transformer.initialize()
        app = worker.Worker(transform_fn=user_module_transformer.transform,
                        module_name=serving_env.module_name)

    return app(environ, start_response)

Thanks,
Harathi

"bash: cannot set terminal process group (-1): Inappropriate ioctl for device" printed at the start of sagemaker jobs

The entrypoint script for the containers is executed with monitor mode enabled (using -m flag), eg. here https://github.com/aws/sagemaker-pytorch-container/blob/97e611b4cb2df13d966d508e56d1c990439b2163/docker/1.3.1/py3/Dockerfile.gpu#L166

This prints the following message at the start of any sagemaker job that uses the container:

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Removing the -m flag gets rid of this message. Is there a particular reason for using this flag? If not we should remove it from all Dockerfiles.

FastAI v1.0.59 causes failed training job

Describe the bug
The fastprogress bar using Fast AI doesn't work correctly with a PyTorch training job.
I got an error TypeError: init() got an unexpected keyword argument 'auto_update', when launching a training job. The attached screenshot is what lead me to this error. The error is described here:
https://forums.fast.ai/t/fastprogress-auto-update/58830

To reproduce
Launch a training job using fast ai version 1.0.59 (used as default at present).

Expected behavior
A successful training job.

Screenshots or logs

System information
A description of your system. Please provide:

Toolkit version: n/a
Framework version: pytorch 1.4.0
Python version: 3
CPU or GPU: CPU
Custom Docker image (Y/N): N

Additional context
This problem should simply require a version increment of fastai to resolve, they've already released the bugfix in 1.0.60.

PyTorch: increasing --shm-size to allow multiprocessing data loaders

As explained here, the default shared memory for docker is quite low. --shm-size seems to be the solution found online (preferable to --ipc=host) but in any case these are args passed to docker run and can not be set on build (?).

What options do we have through SageMaker to configure this? Running training with num_workers=0 makes training way slower... For that matter, is there any way to set args we want to pass to docker run?

Thanks.

PyTorch 0.4.1?

Sagemaker notebooks are using PyTorch 0.4.1 in their anaconda environment but when I try to move into training and specify framework_version='0.4.1' in my pytorch estimator there does not appear to be a default image hosted on aws (using 0.4.0 works however). As there were a number of bug fixes in 0.4.1 is there plans to update the container to 0.4.1 any time soon?

Custom serving code with framework_version beyond 1.1.0

I have a serve.py script that contains the functions model_fn, input_fn, predict_fn, output_fn. I've been using PyTorchModel with framework_version='1.1.0' successfully like this:

model = PyTorchModel(name='model-name',
                     model_data=model_uri,
                     role=role,
                     framework_version='1.1.0',
                     entry_point='serve.py',
                     source_dir=os.path.join(os.getcwd(),'src'),
                     sagemaker_session=sess)

predictor = model.deploy(endpoint_name='endpoint-name',
                         instance_type='ml.p2.xlarge',
                         initial_instance_count=1)

However, I tried upgrading framework_version to either 1.2.0 or 1.3.1. In either case, in the endpoint logs, it appears that it doesn't even enter my serve.py script at all, let alone error out.

After exploring it further I've read that with 1.2.0 and beyond the training/serving containers were split, and inference now uses MXNet Model Server. Does that imply that it's no longer possible to use custom serving code with newer framework versions?

unable to build final dockerfile.cpu

Command used to build Dockerfile: docker build -t preprod-pytorch:1.0.0-cpu-py3 -f docker/1.0.0/final/Dockerfile.cpu --build-arg py_version=3 .

Sending build context to Docker daemon 147.1MB
Step 1/11 : ARG py_version
Step 2/11 : FROM pytorch-base:1.1.0-cpu-py$py_version
---> 721d122daba2
Step 3/11 : LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
---> Using cache
---> 7a87d543152b
Step 4/11 : COPY lib/changehostname.c /
COPY failed: stat /var/lib/docker/tmp/docker-builder479190093/lib/changehostname.c: no such file or directory

TrainingJobAnalytics

Is there any way we can output a csv file for the new TrainingJobAnalytics features with pytorch & the python SDK in our code?

Sagemaker PyTorch Not Recognizing Model_FN

If this is not the right place to ask this I apologize, but I wasn't sure where else to post!

Preamble

I have developed a custom PyTorch Neural Network model for predicting whether or not two colors are complimentary. In order to make it more publicly accessible I have begun looking into hosting it on AWS.

At first I hoped that I would be able to host something simple by loading the torch model's .PT file in E3 and making endpoints in AWS lambda which load that model and make a prediction, but I quickly found out this would not be feasible. So then I looked into AWS Sagemaker. After a bit of toying I was able to load my torch model and make predictions in Sagemaker notebooks, however I could not deploy the notebook to allow endpoints to access it no matter what I tried.

Eventually I found this useful tutorial on developing your own custom ML model for docker based off of Sagemaker. I have primarily followed along with the tutorial, but changing it to serve my own purposes (i.e. using the Pytorch docker base container, using my own model configuration etc. etc.).

The Problem

My model trains properly and I can run the docker image to serve the model up. When I ping the docker image I get a healthy response back and this is the response I get server-side:

2020-01-30 02:58:09,711 [INFO ] pool-1-thread-7 ACCESS_LOG - /IP:STUFF "GET /ping HTTP/1.1" 200

Yet when I try and make a prediction, using this command

command:
      
    ./predict.sh localhost:8080 example.csv text/csv

./predict.sh:
      
    #!/bin/bash

    url=$1
    payload=$2
    content=${3:-text/csv}

    curl --data-binary @${payload} -H "Content-Type: ${content}" -v ${url}/invocations

The curl call hangs, and server side I keep running into this error report:

2020-01-30 03:06:41,652 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 2
2020-01-30 03:06:41,652 [WARN ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker thread exception.
java.lang.IllegalArgumentException: reasonPhrase contains one of the following prohibited characters: \r\n:
Please provide a model_fn implementation.
See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk

	at io.netty.handler.codec.http.HttpResponseStatus.<init>(HttpResponseStatus.java:555)
	at io.netty.handler.codec.http.HttpResponseStatus.<init>(HttpResponseStatus.java:537)
	at io.netty.handler.codec.http.HttpResponseStatus.valueOf(HttpResponseStatus.java:465)
	at com.amazonaws.ml.mms.wlm.Job.response(Job.java:85)
	at com.amazonaws.ml.mms.wlm.BatchAggregator.sendResponse(BatchAggregator.java:85)
	at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:146)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2020-01-30 03:06:41,654 [ERROR] W-9001-model com.amazonaws.ml.mms.wlm.BatchAggregator - Unexpected job: PRIVATE_JOB_ID
2020-01-30 03:06:41,656 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Retry worker: 9001 in 1 seconds.
2020-01-30 03:06:41,667 [INFO ] epollEventLoopGroup-4-5 com.amazonaws.ml.mms.wlm.WorkerThread - 9001 Worker disconnected. WORKER_STOPPED
2020-01-30 03:06:42,761 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9001
2020-01-30 03:06:42,762 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID]126
2020-01-30 03:06:42,762 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MXNet worker started.
2020-01-30 03:06:42,762 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
2020-01-30 03:06:42,762 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9001
2020-01-30 03:06:42,766 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
2020-01-30 03:06:43,060 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 293

What I Have Already Tried

I have read up on the documentation that is listed in that error print out and it seems as if the issue would be that I do not have a model_fn implementation. I read a similar issue someone had here and tried to use their solution to solve the problem (that is I implemented model_fn, input_fn, output_fn, and predict_fn in my train file) but it still does not work.

I have tried looking into similar issues such as this pytorch-container issue but was not able to use it to fix my problem. It is possible I have the _fn series of functions somewhere wrong, but based off of what I have read in other documentation I would be lead to believe it is correct (currently existing in my train file).

For Reference

To see what my train file looks like in opt/program I have attached the code of the file minus the actual training part. As you can see most of it is default code so I am not sure what is going wrong.

def model_fn(model_dir):
    model = NeuralNet.NeuralNet()
    with open(os.path.join(model_dir, 'model.pt'), 'rb') as f:
        # model.load_state_dict(torch.load(f))
        model = torch.load(f)
    return model


def predict_fn(input_data, model):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    with torch.no_grad():
        return model(input_data.to(device))


def input_fn(request_body, request_content_type):
    if request_content_type == 'application/python-pickle':
        return torch.load(BytesIO(request_body))
    else:
        if request_content_type == 'text/csv':
            return pd.read_csv(BytesIO(request_body), header=None, sep=",")


def output_fn(prediction, content_type):
    return json.dumps(prediction)


def train():

I am more than happy to add any additional information you may need, and I really appreciate you taking the time to read this!

Pytorch 1.5 build issue

Step 29/43 : COPY changehostname.c /
COPY failed: stat /var/lib/docker/tmp/docker-builder982420481/changehostname.c: no such file or directory

where's changehostname.c? Didn't see it in root folder?

Pytorch 1.3 ?

Hi,
The 1.3 version of pytorch (https://github.com/pytorch/pytorch/releases) have been released and I was wondering if you were planning to add support ?

Thanks

Issue with torchvision::nms using custom Pytorch and TorchVision

I've been trying to run some training jobs using the torch pytorch-training:1.4.0-cpu-py3 image and have been running into this RuntimeError: No such operator torchvision::nms error. From what I can tell it works if you uninstall the custom torch and torchvision packages and install the ones from pypi. Comparing the two it looks like torch is not loading the torchvision library.

https://github.com/aws/sagemaker-pytorch-container/blob/e87ca0714862ccdba4b380944db3d828cb8c7871/docker/1.4.0/py3/Dockerfile.cpu#L101

$ docker run --rm -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-cpu-py3 bash
root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/site-packages/torch/_ops.py", line 61, in __getattr__
    op = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator torchvision::nms
>>> torch.ops.loaded_libraries
set()

After pip uninstall and install

root@e1e9293e2bd8:/# python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> torch.ops.torchvision.nms
<built-in method nms of PyCapsule object at 0x7ff2ca8bd9f0>
>>> torch.ops.loaded_libraries
{'/opt/conda/lib/python3.6/site-packages/torchvision/_C.so'}

I've been trying to manually build that image locally and having some issues that are related to #141 but that is another issue I'm working through.

unable to build

I am unable to build an image with the dockerfile in this repo (specifically i am using the cpu version).

running the command

docker build -t sagemaker-inference-images .

ends at:

Step 20/29 : COPY lib/changehostname.c / COPY failed: stat /var/lib/docker/tmp/docker-builder354479123/lib/changehostname.c: no such file or directory

PipelineModel Docker Image Bind-to-port flag

Hi,

I'm trying to create a PipelineModel which chains a SKLearn model into a PyTorch model, and I'm encountering the following error:

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Your Ecr Image 763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.2.0-cpu-py3 does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).

My code roughly looks like the following:

preprocesor_estimator = SKLearn.attach('deepevent-feature-processor-model-05g').create_model()

de_estimator= sagemaker.pytorch.PyTorchModel(model_data='s3://<path-to-my-model-data>',
                              role=role,
                              entry_point='de_model.py',
                              image='763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.2.0-cpu-py3',
                              framework_version='1.2.0',
                              sagemaker_session=sagemaker_session)

model_name = 'Model-inference-pipeline-' + timestamp_prefix
endpoint_name = 'Model-inference-pipeline-ep-' + timestamp_prefix

sm_model = PipelineModel(
    name=model_name, 
    role=role, 
    models=[
        preprocesor_estimator, 
        de_estimator])

sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)

Now, this is pretty confusing because as best as I can tell, the required bind-to-port flag is set see here Any guidance would be appreciated, thank you.

[bug] Torch does not find GPU on pytorch-training:1.10.0-gpu-py38 container

Describe the bug
Torch does not find Cuda on GPU instance and official SageMaker training container

To reproduce

sudo docker pull 763104351884.dkr.ecr.eu-west-2.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
sudo docker run -it --entrypoint /bin/bash 709fa9395949
python -c "import torch; print(torch.cuda.is_available()) -> False"

Expected behavior
python -c "import torch; print(torch.cuda.is_available()) -> True

System information
This command was run on SageMaker Notebook instance ml.p3.2xlarge (docker pull from console) and EC2 instance p3.2xlarge

Python versions different in this repo's build vs SageMaker official Pytorch image

When using a sagemaker.pytorch.PyTorch estimator with the default image, the Python version is 3.5.2. I also tried using the specific image 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-pytorch:1.0.0.dev-gpu-py3 which also uses 3.5.2.

The Dockerfiles in this repo use 3.6. Is the official Sagemaker image out of date? Maybe this issue belongs on the Python sdk repo?

ModuleNotFoundError: No module named 'sagemaker_pytorch_container.serving'

I built both the pytorch-1.2.0-gpu-py3 and pytorch-1.1.0-gpu-py3 images and tried to deploy a model in local mode using PytorchModel. I always get the error:

ModuleNotFoundError: No module named 'sagemaker_pytorch_container.serving'

Which makes sense, since when looking both inside the docker image and inside the src/sagemaker_pytorch_container, I don't see any serving.py script, only training.py.

Did I miss something ? How can I build a pytorch image able to do both training and serving from this repo ?

cannot recognize num_gpus for more than 1 gpu per instance

I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return:
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS
{'local-rank': 0, 'rank': 0, 'size': 1}
2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS
Only recognize 1 gpu per instance.

Is there anything I did wrong?

[FATAL tini (7)] exec train failed: No such file or directory

BUG Description
I'm trying to automate and scale a large collection of experiments using AWS SageMamker via Python SDK. However, I am facing an error that does not give any direction to resolve it.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="ml...xlarge",
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior
The model is expected to start to train and log metrics and losses.

Screenshots or logs

[2023-09-10 23:08:59,329][sagemaker][INFO] - Creating training-job with name: xmtc-2023-09-11-02-08-56-094
2023-09-11 02:09:00 Starting - Starting the training job...
2023-09-11 02:09:18 Starting - Preparing the instances for training......
2023-09-11 02:10:27 Downloading - Downloading input data
2023-09-11 02:10:27 Training - Downloading the training image..................
2023-09-11 02:13:33 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: No such file or directory

2023-09-11 02:14:15 Uploading - Uploading generated training model
2023-09-11 02:14:15 Failed - Training job failed
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/LightningPrototype/run_on_sagemaker.py", line 32, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job xmtc-2023-09-11-02-08-56-094: Failed. Reason: AlgorithmError: , exit code: 127

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

System information
A description of your system. Please provide:

SageMaker Python SDK version: sagemaker 2.177.1
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
Python version: Python 3.10
Custom Docker image (Y/N): Yes, on ECR.

ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code

I am using the Sagemaker Pytorch Estimator based on a custom docker image stored in AWS ECR.

from sagemaker.pytorch.estimator import PyTorch

    role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        git_config={"repo": "https://github.com/celsofranssa/LightningPrototype.git", "branch": "sagemaker"},
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="local", # ml.g4dn.2xlarge
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Sagemaker correctly clones the sources from GitHub and performs the checkout into the specified branch.

The Bug:
However, it only copies the main.py to /opt/ml/code inside the container instead of the holy-cloned source code, which causing ModuleNotFoundError: No module named 'source':

Traceback (most recent call last):
2y9byzwyxr-algo-1-reuoy  |   File "/opt/ml/code/main.py", line 15, in <module>
2y9byzwyxr-algo-1-reuoy  |     from source.helper.EvalHelper import EvalHelper
2y9byzwyxr-algo-1-reuoy  | ModuleNotFoundError: No module named 'source'

Logging the /opt/ml/code content only shows the main.py:

print(f"Content: {os.listdir(os.getcwd())}")
['main.py']

Prebuilt PyTorch image difference

Hi there,

I am bringing some PyTorch Model outside of SageMaker,

Here are my steps:

Build my own docker image on top of prebuilt images (pytorch-training vs pytorch-inference vs sagemaker-pytorch(before 1.2.0)
Finish the customized model_fn, predict_fn, input_fn, output_fn.
Deploy the model.

Here are my observations:

With sagemaker-pytorch version 1.1.0, CPU, everything works.
With pytorch-inference, version 1.2.0, CPU, the code are not copied to the container, I guess I should follow this documentation? https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html
With pytorch-training, version 1.2.0, CPU, when I tried to deploy the model locally, it throws errors as following:

Attaching to tmpkyn4_ew2_algo-1-dgrlv_1
algo-1-dgrlv_1  | Traceback (most recent call last):
algo-1-dgrlv_1  |   File "/opt/conda/bin/serve", line 8, in <module>
algo-1-dgrlv_1  |     sys.exit(main())
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/cli/serve.py", line 17, in main
algo-1-dgrlv_1  |     server.start(env.ServingEnv().framework_module)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/site-packages/sagemaker_containers/_server.py", line 75, in start
algo-1-dgrlv_1  |     nginx = subprocess.Popen(['nginx', '-c', nginx_config_file])
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 709, in __init__
algo-1-dgrlv_1  |     restore_signals, start_new_session)
algo-1-dgrlv_1  |   File "/opt/conda/lib/python3.6/subprocess.py", line 1344, in _execute_child
algo-1-dgrlv_1  |     raise child_exception_type(errno_num, err_msg, err_filename)
algo-1-dgrlv_1  | FileNotFoundError: [Errno 2] No such file or directory: 'nginx': 'nginx'
tmpkyn4_ew2_algo-1-dgrlv_1 exited with code 1
Aborting on container exit...

Then wait for container to run until time out.

My questions are:

Any insights for the problem above?
What is the difference between pytorch-training and pytorch-inference?
I checked the Dockerfile among those 3 versions, it seems there are a lot of change for pytorch-<inference|training> from sagemaker-pytroch. If I am not missing something here, it is probably worth to revisit the image for pytorch-<inference|training>?

Strange errors using AWS-provided PyTorch container

Hi, I am running a training job using a SageMaker-provided PyTorch container, and I get the following errors at the top of my Cloudwatch Log file:

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
sed: can't read changehostname.c: No such file or directory
gcc: error: changehostname.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
gcc: error: changehostname.o: No such file or directory
ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/libchangehostname.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
Hyper parameters: 
{}
Found input files: ['/opt/ml/input/data/training/7ff7df01-0981-4e6e-96dc-df4f5db3dfab.csv']

and so on. The "Found input files" line seems to be from my job, all the lines above it seem to have executed before my job.

What do these mean? The training job still seems to execute (although it seems slower - I don't think this is related at the moment).

My Dockerfile looks like:

FROM 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-pytorch:1.1.0-gpu-py3
COPY pytorch_exp/* /opt/program/
RUN chmod +x /opt/program/train
# No need to install requirements. All the requirements of this project are already in the base image above.

ENV PYTHONUNBUFFERED=TRUE
ENV PATH="/opt/program:${PATH}"
WORKDIR /opt/program

Note that I am using the AWS pytorch image.

README link broken

Please see the link at:

All "final" Dockerfiles use base images for building.

Example use case

would it possible to have an example use case of this repository?

Would I clone this whilst in the Sagemaker studio? Would it be possible to build an image from this repository and then push it up to the ECR service, following that mount it as an image to my Sagemaker Studio?

need help with pytorch model serving

Thank you for creating this repository to onboard pytorch base model. There is a question I have in regarding onboarding pytorch models in sagemaker. I would like to find out if customer can use pretrained model saved in .pth file and load this model file in sagemaker container. I have been using TFS for model hosting, so it was straightforward because we did not have to write an Estimator class. Would you have any suggestion on how is it done in Pytorch to port to Sagemaker? Thank you for your advice.

Pytorch 1.2

Pytorch 1.2 dockerfile is provided in this repo but it's not built (not available in ECR):

520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-pytorch:1.2.0-cpu-py3 not found.

container for fastai2

Fastai v2 is already out with the course starting pretty soon. I would like to run the examples on Sagemaker. My feature request would be to have a container for it.
Currently, the docker code is as follows:

`# The following section uninstalls torch and torchvision before installing the
# custom versions from an S3 bucket. This will need to be removed in the future
RUN pip install \
    --no-cache-dir smdebug==0.5.0.post0 \
    sagemaker-experiments==0.1.3 \
    --no-cache-dir fastai==1.0.59 \
 && pip install --no-cache-dir -U https://pytorch-aws.s3.amazonaws.com/pytorch-1.3.1/py3/gpu/torch-1.3.1-cp36-cp36m-manylinux1_x86_64.whl \
 && pip uninstall -y torchvision \
 && pip install --no-cache-dir -U \
    https://torchvision-build.s3.amazonaws.com/1.3.1/gpu/torchvision-0.4.2-cp36-cp36m-linux_x86_64.whl`

I installed fastai v2 in colab easily with the following lines.

pip install -q torch torchvision feather-format kornia pyarrow Pillow==6.2.1 wandb nbdev fastprogress --upgrade pip install -q git+https://github.com/fastai/fastcore --upgrade pip install -q git+https://github.com/fastai/fastai2
would the same work for AWS? Not sure why you are pulling torch and torchvision from s3.

aws / sagemaker-pytorch-training-toolkit Goto Github PK

sagemaker-pytorch-training-toolkit's Introduction

SageMaker PyTorch Training Toolkit

Contributing

License

sagemaker-pytorch-training-toolkit's People

Contributors

Stargazers

Watchers

Forkers

sagemaker-pytorch-training-toolkit's Issues

Preamble

The Problem

What I Have Already Tried

For Reference

Recommend Projects

Recommend Topics

Recommend Org