Giter Site home page Giter Site logo

Comments (15)

nadiaya avatar nadiaya commented on May 20, 2024

Could you share how do create training and then deploying trained model locally?

Before we had one container (sagemaker-pytorch) with both training and serving/inference functionality. To reduce the size of the images we split them into two: pytorch-training and pytorch-inference. The intent is that pytorch-training would only be used for training and pytorch-inference would be used to deploy model and run predictions against it.

From the error message you posted it seems that the problem is caused by using training image to run inference, though I would need more information about how you are training and hosting the model.

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

There is no training, the model is pretrained.

Pesudo code like following:

pytorch_estimator = PyTorchModel(entry_point = 'entrypoint.py',
                                 model_data = MODEL_PATH,
                                 name = MODEL_NAME,
                                 role=role,
                                 image=CONTAINER_IMAGE)

predictor = pytorch_estimator.deploy(instance_type='local',
                                     initial_instance_count=1)

Please let me know if you want more details

from sagemaker-pytorch-training-toolkit.

nadiaya avatar nadiaya commented on May 20, 2024

What image (CONTAINER_IMAGE) do you use to create PyTorchModel?

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

This is a customized image on top of prebuilt aws sagemaker image.

For prebuilt images, I tried:

  1. sagemaker-pytorch
  2. pytorch-training
  3. pytorch-inference

Only 1 works, 2 and 3 failed in different ways.

from sagemaker-pytorch-training-toolkit.

nadiaya avatar nadiaya commented on May 20, 2024

2 is expected to fail.
1 and 3 should work.

What error do you get when using pytorch-inference container?

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

It cannot find the entrypoint.py file, I checked docker image, there is only opt/ml/model folder, no code file.

Some more observations:

  1. The logs said MXNet worker started, makes me feel weird
  2. The source code was uploaded to s3 successfully according to the log output, there is a source.tar.gz, I download it and verified that.

from sagemaker-pytorch-training-toolkit.

nadiaya avatar nadiaya commented on May 20, 2024
  1. You see this message because it uses MMS (Mxnet Model Server) to serve the predictions.
  2. I can't reproduce the issue. The exact code sample as well as produced logs would really help.

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

I am closing the issue for now since you cannot reproduce it. I will do more experiments.

I may reopen it once I got more info.

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

from sagemaker-pytorch-training-toolkit.

nadiaya avatar nadiaya commented on May 20, 2024

Thanks!

When do you get this error? on start up or when trying to run predictions?

from sagemaker-pytorch-training-toolkit.

ruijianw avatar ruijianw commented on May 20, 2024

when trying to run predictions. The container started successfully, please refer to the following logs for spinning up the container:

algo-1-pmyh1_1  | 2019-11-11 16:30:48,040 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.6
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
algo-1-pmyh1_1  | 2019-11-11 16:30:48,056 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,059 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081
algo-1-pmyh1_1  | Model server started.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9030-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9030.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9015-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9015.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9021-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9021.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9029-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9029.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9012-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9012.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9024-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9024.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9003-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9003.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9008-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9008.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9016-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9016.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9020-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9020.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9017-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9017.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9027-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9027.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9031-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9031.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,063 [INFO ] W-9011-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9011.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9013-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9013.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9005-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9005.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,062 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9022.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9007.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9023-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9023.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,061 [INFO ] W-9002-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9002.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9018-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9018.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9009-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9009.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9014-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9014.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9025-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9025.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9004-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9004.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,064 [INFO ] W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9001.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9006-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9006.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9019-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9019.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9010-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9010.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9026-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9026.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,065 [INFO ] W-9028-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9028.
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 462
algo-1-pmyh1_1  | 2019-11-11 16:30:48,564 [INFO ] W-9029-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 463
algo-1-pmyh1_1  | 2019-11-11 16:30:48,565 [INFO ] W-9030-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 460
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 475
algo-1-pmyh1_1  | 2019-11-11 16:30:48,576 [INFO ] W-9008-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 455
algo-1-pmyh1_1  | 2019-11-11 16:30:48,577 [INFO ] W-9024-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 476
algo-1-pmyh1_1  | 2019-11-11 16:30:48,580 [INFO ] W-9027-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 471
algo-1-pmyh1_1  | 2019-11-11 16:30:48,583 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 478
algo-1-pmyh1_1  | 2019-11-11 16:30:48,585 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 483
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9026-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,586 [INFO ] W-9031-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 485
algo-1-pmyh1_1  | 2019-11-11 16:30:48,599 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,605 [INFO ] W-9023-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 504
algo-1-pmyh1_1  | 2019-11-11 16:30:48,610 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 501
algo-1-pmyh1_1  | 2019-11-11 16:30:48,611 [INFO ] W-9019-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 494
algo-1-pmyh1_1  | 2019-11-11 16:30:48,615 [INFO ] W-9014-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 514
algo-1-pmyh1_1  | 2019-11-11 16:30:48,617 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 516
algo-1-pmyh1_1  | 2019-11-11 16:30:48,618 [INFO ] W-9017-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9012-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 523
algo-1-pmyh1_1  | 2019-11-11 16:30:48,624 [INFO ] W-9020-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 519
algo-1-pmyh1_1  | 2019-11-11 16:30:48,625 [INFO ] W-9015-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 520
algo-1-pmyh1_1  | 2019-11-11 16:30:48,631 [INFO ] W-9011-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 522
algo-1-pmyh1_1  | 2019-11-11 16:30:48,633 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,636 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 535
algo-1-pmyh1_1  | 2019-11-11 16:30:48,643 [INFO ] W-9025-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 542
algo-1-pmyh1_1  | 2019-11-11 16:30:48,645 [INFO ] W-9009-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 543
algo-1-pmyh1_1  | 2019-11-11 16:30:48,650 [INFO ] W-9018-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 532
algo-1-pmyh1_1  | 2019-11-11 16:30:48,664 [INFO ] W-9028-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 541
algo-1-pmyh1_1  | 2019-11-11 16:30:48,666 [INFO ] W-9013-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 562
algo-1-pmyh1_1  | 2019-11-11 16:30:48,671 [INFO ] W-9021-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 570
algo-1-pmyh1_1  | 2019-11-11 16:30:48,673 [INFO ] W-9016-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 576
algo-1-pmyh1_1  | 2019-11-11 16:30:48,676 [INFO ] W-9010-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 579
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 10
algo-1-pmyh1_1  | 2019-11-11 16:30:49,982 [INFO ] pool-1-thread-33 ACCESS_LOG - /172.18.0.1:58984 "GET /ping HTTP/1.1" 200 11```

from sagemaker-pytorch-training-toolkit.

stale avatar stale commented on May 20, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from sagemaker-pytorch-training-toolkit.

stale avatar stale commented on May 20, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from sagemaker-pytorch-training-toolkit.

ChoiByungWook avatar ChoiByungWook commented on May 20, 2024

For now, I would like to give it another try, following is the error message with pytorch-inference image:

algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict
algo-1-pmyh1_1  | 2019-11-11 16:31:06,305 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     ret = self._entry_point(input_batch, self.context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/default_handler_service.py", line 31, in handle
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return self._service.transform(data, context)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 55, in transform
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.validate_and_initialize()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 92, in validate_and_initialize
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self._validate_user_module_and_set_functions()
algo-1-pmyh1_1  | 2019-11-11 16:31:06,306 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 103, in _validate_user_module_and_set_functions
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     user_module = importlib.import_module(self._environment.module_name)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/importlib/__init__.py", line 126, in import_module
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
algo-1-pmyh1_1  | 2019-11-11 16:31:06,307 [INFO ] W-9022-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'handler'
algo-1-pmyh1_1  | 2019-11-11 16:31:06,308 [INFO ] W-9022-model ACCESS_LOG - /172.18.0.1:58992 "POST /invocations HTTP/1.1" 503 8```

Apologies for the late response.

That specific error happens when attempting to import your entrypoint.py as shown here: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/transformer.py#L143

The entrypoint.py is expected to be in a specific directory, which will get extended using the PythonPath: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L103

The specific directory itself is defined by: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/environment.py#L32

The entrypoint.py should be placed in that specific directory by the Python SDK depending on the framework version specified as shown here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/pytorch/model.py#L148

Looking at how you are starting the inference jobs, it looks like the framework_version is being omitted, which may not cause the conditional to place the entrypoint.py into the specified directory.

I apologize for the experience as this is not ideal, however is there any chance you can retry your job after placing a framework version higher than 1.2?

Thanks!

from sagemaker-pytorch-training-toolkit.

nadiaya avatar nadiaya commented on May 20, 2024

Closing due to inactivity.

from sagemaker-pytorch-training-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.