Comments (2)
@bbalaji-ucsd which PyTorch container(PT version) did you use for this? what is the torchvision version?
from sagemaker-pytorch-training-toolkit.
@Roshrini I tried both 1.7 and 1.8 containers and they didn't work. I'm able to make my custom docker work with 1.6 though. I'm using that container for my current experiments.
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu101-ubuntu16.04
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04-v1.5
RUN pip install -U pip
RUN pip install -U torch
RUN pip install -U torchvision
# torchaudio doesn't work with the 1.7.1 or 1.8.1 containers
RUN pip install torchaudio
from sagemaker-pytorch-training-toolkit.
Related Issues (20)
- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device" printed at the start of sagemaker jobs HOT 3
- Training on GPU with a custom container based on official pytorch-training container HOT 2
- Custom serving code with framework_version beyond 1.1.0 HOT 5
- Issue with torchvision::nms using custom Pytorch and TorchVision HOT 20
- requirements.txt not working HOT 2
- RuntimeError in training a model of resnet152 using transfer learning: "models cannot register a hook on a tensor that doesn't require gradient" HOT 3
- Pytorch 1.5 build issue HOT 2
- unable to build final dockerfile.cpu HOT 4
- FastAI v1.0.59 causes failed training job HOT 1
- cannot recognize num_gpus for more than 1 gpu per instance HOT 4
- Getting cudnn error while training on ml.p2.xlarge instance HOT 2
- Example use case HOT 2
- Dockerfile installation of torch and torchvision from s3, replacing original versions.
- model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment HOT 1
- Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint HOT 1
- [bug] Torch does not find GPU on pytorch-training:1.10.0-gpu-py38 container
- "Train": executable file not found in $PATH
- [FATAL tini (7)] exec train failed: No such file or directory
- ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sagemaker-pytorch-training-toolkit.