Comments (15)
Could you share how do create training and then deploying trained model locally?
Before we had one container (sagemaker-pytorch
) with both training and serving/inference functionality. To reduce the size of the images we split them into two: pytorch-training
and pytorch-inference
. The intent is that pytorch-training
would only be used for training and pytorch-inference
would be used to deploy model and run predictions against it.
From the error message you posted it seems that the problem is caused by using training image to run inference, though I would need more information about how you are training and hosting the model.
from sagemaker-pytorch-training-toolkit.
There is no training, the model is pretrained.
Pesudo code like following:
pytorch_estimator = PyTorchModel(entry_point = 'entrypoint.py',
model_data = MODEL_PATH,
name = MODEL_NAME,
role=role,
image=CONTAINER_IMAGE)
predictor = pytorch_estimator.deploy(instance_type='local',
initial_instance_count=1)
Please let me know if you want more details
from sagemaker-pytorch-training-toolkit.
What image (CONTAINER_IMAGE
) do you use to create PyTorchModel?
from sagemaker-pytorch-training-toolkit.
This is a customized image on top of prebuilt aws sagemaker image.
For prebuilt images, I tried:
sagemaker-pytorch
pytorch-training
pytorch-inference
Only 1
works, 2
and 3
failed in different ways.
from sagemaker-pytorch-training-toolkit.
2 is expected to fail.
1 and 3 should work.
What error do you get when using pytorch-inference
container?
from sagemaker-pytorch-training-toolkit.
It cannot find the entrypoint.py
file, I checked docker image, there is only opt/ml/model
folder, no code file.
Some more observations:
- The logs said
MXNet worker started
, makes me feel weird - The source code was uploaded to s3 successfully according to the log output, there is a
source.tar.gz
, I download it and verified that.
from sagemaker-pytorch-training-toolkit.
- You see this message because it uses MMS (Mxnet Model Server) to serve the predictions.
- I can't reproduce the issue. The exact code sample as well as produced logs would really help.
from sagemaker-pytorch-training-toolkit.
I am closing the issue for now since you cannot reproduce it. I will do more experiments.
I may reopen it once I got more info.
from sagemaker-pytorch-training-toolkit.
For now, I would like to give it another try, following is the error message with pytorch-inference
image:
algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self._service.transform(data, context)
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.validate_and_initialize()
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._validate_user_module_and_set_functions()
algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1 | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```
from sagemaker-pytorch-training-toolkit.
Thanks!
When do you get this error? on start up or when trying to run predictions?
from sagemaker-pytorch-training-toolkit.
when trying to run predictions. The container started successfully, please refer to the following logs for spinning up the container:
algo-1-pmyh1_1 | 2019-11-11 16:30:48,040 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-pmyh1_1 | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-pmyh1_1 | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,059 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
algo-1-pmyh1_1 | Model server started.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9030-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9030.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9015-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9015.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9021-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9021.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9029-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9029.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9012-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9012.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,062 [INFO ] W-9024-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9024.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,062 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9003.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9008-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9008.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9016-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9016.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9020-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9020.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9017-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9017.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9027-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9027.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,063 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9031.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,063 [INFO ] W-9011-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9011.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9013-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9013.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9005.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,062 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9022.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9007.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9023-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9023.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,061 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9002.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9018-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9018.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9009-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9009.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9014-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9014.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9025-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9025.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9004.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,064 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9006-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9006.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9019-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9019.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9010-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9010.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9026-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9026.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,065 [INFO ] W-9028-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9028.
algo-1-pmyh1_1 | 2019-11-11 16:30:48,564 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 462
algo-1-pmyh1_1 | 2019-11-11 16:30:48,564 [INFO ] W-9029-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 463
algo-1-pmyh1_1 | 2019-11-11 16:30:48,565 [INFO ] W-9030-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 460
algo-1-pmyh1_1 | 2019-11-11 16:30:48,576 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 475
algo-1-pmyh1_1 | 2019-11-11 16:30:48,576 [INFO ] W-9008-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 455
algo-1-pmyh1_1 | 2019-11-11 16:30:48,577 [INFO ] W-9024-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 476
algo-1-pmyh1_1 | 2019-11-11 16:30:48,580 [INFO ] W-9027-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 471
algo-1-pmyh1_1 | 2019-11-11 16:30:48,583 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 478
algo-1-pmyh1_1 | 2019-11-11 16:30:48,585 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 483
algo-1-pmyh1_1 | 2019-11-11 16:30:48,586 [INFO ] W-9026-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1 | 2019-11-11 16:30:48,586 [INFO ] W-9031-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1 | 2019-11-11 16:30:48,599 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1 | 2019-11-11 16:30:48,605 [INFO ] W-9023-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 504
algo-1-pmyh1_1 | 2019-11-11 16:30:48,610 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 501
algo-1-pmyh1_1 | 2019-11-11 16:30:48,611 [INFO ] W-9019-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1 | 2019-11-11 16:30:48,615 [INFO ] W-9014-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 514
algo-1-pmyh1_1 | 2019-11-11 16:30:48,617 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 516
algo-1-pmyh1_1 | 2019-11-11 16:30:48,618 [INFO ] W-9017-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1 | 2019-11-11 16:30:48,624 [INFO ] W-9012-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 523
algo-1-pmyh1_1 | 2019-11-11 16:30:48,624 [INFO ] W-9020-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 519
algo-1-pmyh1_1 | 2019-11-11 16:30:48,625 [INFO ] W-9015-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1 | 2019-11-11 16:30:48,631 [INFO ] W-9011-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 522
algo-1-pmyh1_1 | 2019-11-11 16:30:48,633 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1 | 2019-11-11 16:30:48,636 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 535
algo-1-pmyh1_1 | 2019-11-11 16:30:48,643 [INFO ] W-9025-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 542
algo-1-pmyh1_1 | 2019-11-11 16:30:48,645 [INFO ] W-9009-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 543
algo-1-pmyh1_1 | 2019-11-11 16:30:48,650 [INFO ] W-9018-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1 | 2019-11-11 16:30:48,664 [INFO ] W-9028-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 541
algo-1-pmyh1_1 | 2019-11-11 16:30:48,666 [INFO ] W-9013-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 562
algo-1-pmyh1_1 | 2019-11-11 16:30:48,671 [INFO ] W-9021-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 570
algo-1-pmyh1_1 | 2019-11-11 16:30:48,673 [INFO ] W-9016-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 576
algo-1-pmyh1_1 | 2019-11-11 16:30:48,676 [INFO ] W-9010-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 579
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 10
algo-1-pmyh1_1 | 2019-11-11 16:30:49,982 [INFO ] pool-1-thread-33 ACCESS_LOG - /172.18.0.1:58984 "GET /ping HTTP/1.1" 200 11```
from sagemaker-pytorch-training-toolkit.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from sagemaker-pytorch-training-toolkit.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
from sagemaker-pytorch-training-toolkit.
For now, I would like to give it another try, following is the error message with
pytorch-inference
image:algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last): algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict algo-1-pmyh1_1 | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ret = self._entry_point(input_batch, self.context) algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self._service.transform(data, context) algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3 algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.validate_and_initialize() algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._validate_user_module_and_set_functions() algo-1-pmyh1_1 | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - user_module = importlib.import_module(self._environment.module_name) algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level) algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 994, in _gcd_import algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 971, in _find_and_load algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked algo-1-pmyh1_1 | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler' algo-1-pmyh1_1 | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```
Apologies for the late response.
That specific error happens when attempting to import your entrypoint.py as shown here: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L143
The entrypoint.py is expected to be in a specific directory, which will get extended using the PythonPath: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L103
The specific directory itself is defined by: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/environment.py#L32
The entrypoint.py should be placed in that specific directory by the Python SDK depending on the framework version specified as shown here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/pytorch/model.py#L148
Looking at how you are starting the inference jobs, it looks like the framework_version is being omitted, which may not cause the conditional to place the entrypoint.py into the specified directory.
I apologize for the experience as this is not ideal, however is there any chance you can retry your job after placing a framework version higher than 1.2?
Thanks!
from sagemaker-pytorch-training-toolkit.
Closing due to inactivity.
from sagemaker-pytorch-training-toolkit.
Related Issues (20)
- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device" printed at the start of sagemaker jobs HOT 3
- Training on GPU with a custom container based on official pytorch-training container HOT 2
- Custom serving code with framework_version beyond 1.1.0 HOT 5
- Issue with torchvision::nms using custom Pytorch and TorchVision HOT 20
- requirements.txt not working HOT 2
- RuntimeError in training a model of resnet152 using transfer learning: "models cannot register a hook on a tensor that doesn't require gradient" HOT 3
- Pytorch 1.5 build issue HOT 2
- unable to build final dockerfile.cpu HOT 4
- FastAI v1.0.59 causes failed training job HOT 1
- cannot recognize num_gpus for more than 1 gpu per instance HOT 4
- Getting cudnn error while training on ml.p2.xlarge instance HOT 2
- Error importing torchaudio HOT 2
- Example use case HOT 2
- Dockerfile installation of torch and torchvision from s3, replacing original versions.
- model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment HOT 1
- Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint HOT 1
- [bug] Torch does not find GPU on pytorch-training:1.10.0-gpu-py38 container
- "Train": executable file not found in $PATH
- [FATAL tini (7)] exec train failed: No such file or directory
- ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sagemaker-pytorch-training-toolkit.