Comments (7)
The logic in the pytorch image expects to start at the root directory where the changehostname.c
is located.
And in your image you change the WORKDIR to /opt/program
.
changehostname
was originally necessary for NCCL distributed training to work properly on SageMaker. If you are not running distributed training using NCCL backend this error should be harmless, though as a quick work around you can copy this file from root directory into your working directory.
from sagemaker-pytorch-training-toolkit.
Thank you @nadiaya, this fixes those errors. BTW it also needed the changehostname library, so I just made my working directory /
instead.
However, I have another issue. It seems if I try using boto3 within the container to do something, it fails with the following error:
2020-01-13 17:11:35,800 botocore.utils [DEBUG] Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/api/token: Could not connect to the endpoint URL: "http://169.254.169.254/latest/api/token"
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
OSError: [Errno 22] Invalid argument
The stack trace is pretty long but it all seems like it can't connect to the metadata API. I did not have these issues with the SageMaker Tensorflow repo, and I'm using the same role for both.
from sagemaker-pytorch-training-toolkit.
can you paste your code that is using boto3?
from sagemaker-pytorch-training-toolkit.
Hi @laurenyu, I whittled it down to the following:
import boto3
from datetime import datetime, timezone
import os
import sys
import traceback
boto3.set_stream_logger(name="botocore", level="DEBUG")
prefix = "/opt/ml/"
output_path = os.path.join(prefix, "output")
try:
client = boto3.client("events", region_name="us-east-1")
client.put_events(
Entries=[
{
"Time": datetime.now(timezone.utc).isoformat(),
"Source": "com.business.sagemaker-trainer",
"DetailType": "Training progress notification",
"Detail": """ {"foo": "bar"} """,
}
]
)
sys.exit(0)
except Exception as e:
trc = traceback.format_exc()
with open(os.path.join(output_path, "failure"), "w") as s:
s.write("Exception during training: " + str(e) + "\n" + trc)
print("Exception during training: " + str(e) + "\n" + trc)
sys.exit(255)
This fails with the above error. My Dockerfile looks like:
FROM 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-pytorch:1.1.0-gpu-py3
COPY . /
RUN chmod +x /train
# No need to install requirements. All the requirements of this project are already in the base image above.
# Upgrade boto3:
RUN pip install -I -U boto3
ENV PYTHONUNBUFFERED=TRUE
ENV PATH="/:${PATH}"
WORKDIR /
The bottom of the long stacktrace essentially says botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/"
. I also tried replacing the FROM ...
above with the sagemaker tensorflow images and it gives me the same error, so now I'm worried that something changed recently? Or am I missing something really simple?
from sagemaker-pytorch-training-toolkit.
maybe it's an AWS CLI version issue? aws/aws-cli#4682
does the same thing happen if you use the latest pre-built images? https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html
from sagemaker-pytorch-training-toolkit.
Yes - just tried using 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.3.1-gpu-py3
-- the latest pytorch, and I get the same error :/
from sagemaker-pytorch-training-toolkit.
Hi @laurenyu the problem was that Container Isolation was on. I was unaware that this option existed. In any case, turning it off fixes the issue. Thank you!
from sagemaker-pytorch-training-toolkit.
Related Issues (20)
- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device" printed at the start of sagemaker jobs HOT 3
- Training on GPU with a custom container based on official pytorch-training container HOT 2
- Custom serving code with framework_version beyond 1.1.0 HOT 5
- Issue with torchvision::nms using custom Pytorch and TorchVision HOT 20
- requirements.txt not working HOT 2
- RuntimeError in training a model of resnet152 using transfer learning: "models cannot register a hook on a tensor that doesn't require gradient" HOT 3
- Pytorch 1.5 build issue HOT 2
- unable to build final dockerfile.cpu HOT 4
- FastAI v1.0.59 causes failed training job HOT 1
- cannot recognize num_gpus for more than 1 gpu per instance HOT 4
- Getting cudnn error while training on ml.p2.xlarge instance HOT 2
- Error importing torchaudio HOT 2
- Example use case HOT 2
- Dockerfile installation of torch and torchvision from s3, replacing original versions.
- model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment HOT 1
- Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint HOT 1
- [bug] Torch does not find GPU on pytorch-training:1.10.0-gpu-py38 container
- "Train": executable file not found in $PATH
- [FATAL tini (7)] exec train failed: No such file or directory
- ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sagemaker-pytorch-training-toolkit.