Giter Site home page Giter Site logo

Comments (7)

nadiaya avatar nadiaya commented on June 1, 2024 1

The logic in the pytorch image expects to start at the root directory where the changehostname.c is located.
And in your image you change the WORKDIR to /opt/program.

changehostname was originally necessary for NCCL distributed training to work properly on SageMaker. If you are not running distributed training using NCCL backend this error should be harmless, though as a quick work around you can copy this file from root directory into your working directory.

from sagemaker-pytorch-training-toolkit.

domino14 avatar domino14 commented on June 1, 2024

Thank you @nadiaya, this fixes those errors. BTW it also needed the changehostname library, so I just made my working directory / instead.

However, I have another issue. It seems if I try using boto3 within the container to do something, it fails with the following error:

2020-01-13 17:11:35,800 botocore.utils [DEBUG] Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/api/token: Could not connect to the endpoint URL: "http://169.254.169.254/latest/api/token"
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
OSError: [Errno 22] Invalid argument

The stack trace is pretty long but it all seems like it can't connect to the metadata API. I did not have these issues with the SageMaker Tensorflow repo, and I'm using the same role for both.

from sagemaker-pytorch-training-toolkit.

laurenyu avatar laurenyu commented on June 1, 2024

can you paste your code that is using boto3?

from sagemaker-pytorch-training-toolkit.

domino14 avatar domino14 commented on June 1, 2024

Hi @laurenyu, I whittled it down to the following:

import boto3
from datetime import datetime, timezone
import os
import sys
import traceback

boto3.set_stream_logger(name="botocore", level="DEBUG")
prefix = "/opt/ml/"

output_path = os.path.join(prefix, "output")

try:
    client = boto3.client("events", region_name="us-east-1")
    client.put_events(
        Entries=[
            {
                "Time": datetime.now(timezone.utc).isoformat(),
                "Source": "com.business.sagemaker-trainer",
                "DetailType": "Training progress notification",
                "Detail": """ {"foo": "bar"} """,
            }
        ]
    )
    sys.exit(0)
except Exception as e:
    trc = traceback.format_exc()
    with open(os.path.join(output_path, "failure"), "w") as s:
        s.write("Exception during training: " + str(e) + "\n" + trc)
    print("Exception during training: " + str(e) + "\n" + trc)
    sys.exit(255)

This fails with the above error. My Dockerfile looks like:

FROM 520713654638.dkr.ecr.us-east-1.amazonaws.com/sagemaker-pytorch:1.1.0-gpu-py3

COPY . /

RUN chmod +x /train
# No need to install requirements. All the requirements of this project are already in the base image above.
# Upgrade boto3:
RUN pip install -I -U boto3
ENV PYTHONUNBUFFERED=TRUE
ENV PATH="/:${PATH}"
WORKDIR /

The bottom of the long stacktrace essentially says botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://169.254.169.254/latest/meta-data/iam/security-credentials/" . I also tried replacing the FROM ... above with the sagemaker tensorflow images and it gives me the same error, so now I'm worried that something changed recently? Or am I missing something really simple?

from sagemaker-pytorch-training-toolkit.

laurenyu avatar laurenyu commented on June 1, 2024

maybe it's an AWS CLI version issue? aws/aws-cli#4682

does the same thing happen if you use the latest pre-built images? https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html

from sagemaker-pytorch-training-toolkit.

domino14 avatar domino14 commented on June 1, 2024

Yes - just tried using 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.3.1-gpu-py3 -- the latest pytorch, and I get the same error :/

from sagemaker-pytorch-training-toolkit.

domino14 avatar domino14 commented on June 1, 2024

Hi @laurenyu the problem was that Container Isolation was on. I was unaware that this option existed. In any case, turning it off fixes the issue. Thank you!

from sagemaker-pytorch-training-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.