Hi there, I am bringing some PyTorch Model outside of SageMaker,

You see this message because it uses MMS (Mxnet Model Server) to serve the predi

Prebuilt PyTorch image difference,about aws/sagemaker-pytorch-training-toolkit

Comments (15)

nadiaya commented on May 20, 2024

Could you share how do create training and then deploying trained model locally?

Before we had one container (sagemaker-pytorch) with both training and serving/inference functionality. To reduce the size of the images we split them into two: pytorch-training and pytorch-inference. The intent is that pytorch-training would only be used for training and pytorch-inference would be used to deploy model and run predictions against it.

From the error message you posted it seems that the problem is caused by using training image to run inference, though I would need more information about how you are training and hosting the model.

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

There is no training, the model is pretrained.

Pesudo code like following:

pytorch_estimator = PyTorchModel(entry_point = 'entrypoint.py',
                                 model_data = MODEL_PATH,
                                 name = MODEL_NAME,
                                 role=role,
                                 image=CONTAINER_IMAGE)

predictor = pytorch_estimator.deploy(instance_type='local',
                                     initial_instance_count=1)

Please let me know if you want more details

from sagemaker-pytorch-training-toolkit.

nadiaya commented on May 20, 2024

What image (CONTAINER_IMAGE) do you use to create PyTorchModel?

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

This is a customized image on top of prebuilt aws sagemaker image.

For prebuilt images, I tried:

sagemaker-pytorch
pytorch-training
pytorch-inference

Only 1 works, 2 and 3 failed in different ways.

from sagemaker-pytorch-training-toolkit.

nadiaya commented on May 20, 2024

2 is expected to fail.
1 and 3 should work.

What error do you get when using pytorch-inference container?

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

It cannot find the entrypoint.py file, I checked docker image, there is only opt/ml/model folder, no code file.

Some more observations:

The logs said MXNet worker started, makes me feel weird
The source code was uploaded to s3 successfully according to the log output, there is a source.tar.gz, I download it and verified that.

from sagemaker-pytorch-training-toolkit.

nadiaya commented on May 20, 2024

You see this message because it uses MMS (Mxnet Model Server) to serve the predictions.
I can't reproduce the issue. The exact code sample as well as produced logs would really help.

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

I am closing the issue for now since you cannot reproduce it. I will do more experiments.

I may reopen it once I got more info.

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

from sagemaker-pytorch-training-toolkit.

nadiaya commented on May 20, 2024

Thanks!

When do you get this error? on start up or when trying to run predictions?

from sagemaker-pytorch-training-toolkit.

ruijianw commented on May 20, 2024

when trying to run predictions. The container started successfully, please refer to the following logs for spinning up the container:

algo-1-pmyh1_1  | 2019-11-11 16:30:48,040 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,059 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
algo-1-pmyh1_1  | Model server started.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9030-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9030.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9015-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9015.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9021-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9021.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9029-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9029.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9012-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9012.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9024-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9024.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9003.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9008-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9008.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9016-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9016.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9020-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9020.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9017-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9017.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9027-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9027.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9031.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9011-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9011.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9013-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9013.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9005.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9022.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9007.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9023-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9023.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9002.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9018-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9018.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9009-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9009.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9014-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9014.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9025-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9025.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9004.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9006-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9006.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9019-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9019.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9010-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9010.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9026-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9026.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9028-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9028.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 462
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9029-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 463
algo-1-pmyh1_1  | 2019-11-11 16:30:48,565 [INFO ] W-9030-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 460
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 475
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9008-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 455
algo-1-pmyh1_1  | 2019-11-11 16:30:48,577 [INFO ] W-9024-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 476
algo-1-pmyh1_1  | 2019-11-11 16:30:48,580 [INFO ] W-9027-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 471
algo-1-pmyh1_1  | 2019-11-11 16:30:48,583 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 478
algo-1-pmyh1_1  | 2019-11-11 16:30:48,585 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 483
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9026-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9031-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,599 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,605 [INFO ] W-9023-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 504
algo-1-pmyh1_1  | 2019-11-11 16:30:48,610 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 501
algo-1-pmyh1_1  | 2019-11-11 16:30:48,611 [INFO ] W-9019-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,615 [INFO ] W-9014-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 514
algo-1-pmyh1_1  | 2019-11-11 16:30:48,617 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 516
algo-1-pmyh1_1  | 2019-11-11 16:30:48,618 [INFO ] W-9017-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9012-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 523
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9020-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 519
algo-1-pmyh1_1  | 2019-11-11 16:30:48,625 [INFO ] W-9015-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,631 [INFO ] W-9011-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 522
algo-1-pmyh1_1  | 2019-11-11 16:30:48,633 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,636 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 535
algo-1-pmyh1_1  | 2019-11-11 16:30:48,643 [INFO ] W-9025-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 542
algo-1-pmyh1_1  | 2019-11-11 16:30:48,645 [INFO ] W-9009-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 543
algo-1-pmyh1_1  | 2019-11-11 16:30:48,650 [INFO ] W-9018-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,664 [INFO ] W-9028-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 541
algo-1-pmyh1_1  | 2019-11-11 16:30:48,666 [INFO ] W-9013-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 562
algo-1-pmyh1_1  | 2019-11-11 16:30:48,671 [INFO ] W-9021-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 570
algo-1-pmyh1_1  | 2019-11-11 16:30:48,673 [INFO ] W-9016-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 576
algo-1-pmyh1_1  | 2019-11-11 16:30:48,676 [INFO ] W-9010-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 579
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 10
algo-1-pmyh1_1  | 2019-11-11 16:30:49,982 [INFO ] pool-1-thread-33 ACCESS_LOG - /172.18.0.1:58984 "GET /ping HTTP/1.1" 200 11```

from sagemaker-pytorch-training-toolkit.

stale commented on May 20, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from sagemaker-pytorch-training-toolkit.

stale commented on May 20, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from sagemaker-pytorch-training-toolkit.

ChoiByungWook commented on May 20, 2024

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

Apologies for the late response.

That specific error happens when attempting to import your entrypoint.py as shown here: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L143

The entrypoint.py is expected to be in a specific directory, which will get extended using the PythonPath: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L103

The specific directory itself is defined by: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/environment.py#L32

The entrypoint.py should be placed in that specific directory by the Python SDK depending on the framework version specified as shown here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/pytorch/model.py#L148

Looking at how you are starting the inference jobs, it looks like the framework_version is being omitted, which may not cause the conditional to place the entrypoint.py into the specified directory.

I apologize for the experience as this is not ideal, however is there any chance you can retry your job after placing a framework version higher than 1.2?

Thanks!

from sagemaker-pytorch-training-toolkit.

nadiaya commented on May 20, 2024

Closing due to inactivity.

from sagemaker-pytorch-training-toolkit.

Prebuilt PyTorch image difference about sagemaker-pytorch-training-toolkit HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent