Giter Site home page Giter Site logo

aws / sagemaker-xgboost-container Goto Github PK

View Code? Open in Web Editor NEW
114.0 29.0 69.0 827 KB

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

License: Apache License 2.0

Python 99.25% Java 0.75%
aws inference machine-learning python sagemaker training xgboost distributed-training gbm

sagemaker-xgboost-container's Introduction

SageMaker XGBoost Container

SageMaker XGBoost Container is an open source library for making the XGBoost framework run on Amazon SageMaker.

This repository also contains Dockerfiles which install this library and dependencies for building SageMaker XGBoost Framework images.

The SageMaker team uses this repository to build its official XGBoost Framework image. To use this image on SageMaker, see Python SDK. For end users, this repository is typically of interest if you need implementation details for the official image, or if you want to use it to build your own customized XGBoost Framework image.

Table of Contents

  1. Getting Started
  2. Building your Image
  3. Running the tests

Getting Started

Prerequisites

Make sure you have installed all of the following prerequisites on your development machine:

Note: CMake is required for XGBoost. If using macOS, install CMake (pip install cmake)

Building your image

Amazon SageMaker utilizes Docker containers to run all training jobs & inference endpoints.

The Docker images are built from the Dockerfiles specified in Docker/.

The Docker files are grouped based on XGboost version and separated based on Python version and processor type.

The Docker images, used to run training & inference jobs, are built from both corresponding "base" and "final" Dockerfiles.

Base Images

The "base" Dockerfile encompass the installation of the framework and all of the dependencies needed.

Tagging scheme is based on <SageMaker-XGBoost-version>-cpu-py3 (e.g. 1.7-1-cpu-py3), where

<SageMaker-XGBoost-version> is comprised of <XGBoost-version>-<SageMaker-version>.

All "final" Dockerfiles build images using base images that use the tagging scheme above.

If you want to build your base docker image, then use:

# All build instructions assume you're building from the root directory of the sagemaker-xgboost-container.

# CPU
docker build -t xgboost-container-base:<SageMaker-XGBoost-version>-cpu-py3 -f docker/<SageMaker-XGBoost-version>/base/Dockerfile.cpu .

# Example

# CPU docker build -t xgboost-container-base:1.7-1-cpu-py3 -f docker/1.7-1/base/Dockerfile.cpu .

Final Images

The "final" Dockerfiles encompass the installation of the SageMaker specific support code.

All "final" Dockerfiles use base images for building.

These "base" images are specified with the naming convention of xgboost-container-base:<SageMaker-XGBoost-version>-cpu-py3.

Before building "final" images:

Build your "base" image. Make sure it is named and tagged in accordance with your "final" Dockerfile.

# Create the SageMaker XGBoost Container Python package.
cd sagemaker-xgboost-container
python setup.py bdist_wheel --universal

If you want to build "final" Docker images, then use:

# All build instructions assume you're building from the root directory of the sagemaker-xgboost-container.

# CPU
docker build -t <image_name>:<tag> -f docker/<xgboost-version>/final/Dockerfile.cpu .

# Example

# CPU docker build -t preprod-xgboost-container:1.7-1-cpu-py3 -f docker/1.7-1/final/Dockerfile.cpu .

Running the tests

Running the tests requires installation of the SageMaker XGBoost Framework container code and its test dependencies.

git clone https://github.com/aws/sagemaker-xgboost-container.git
cd sagemaker-xgboost-container
# The below command will work if you're using bash as the shell.
pip install -e .[test]

Conda is also required and can be installed by following the instructions at https://conda.io/projects/conda/en/latest/user-guide/install/index.html. For convenience, the Linux installation commands are provided as an example.

curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3
rm Miniconda3-latest-Linux-x86_64.sh
export PATH=/miniconda3/bin:${PATH}
conda update -y conda

Tests are defined in test/ and include unit, local integration, and SageMaker integration tests.

Unit Tests

If you want to run unit tests, then use:

# All test instructions should be run from the top level directory

pytest test/unit

# or you can use tox to run unit tests as well as flake8 and code coverage

tox
tox -e py3-xgboost1.0,flake8
tox -e py3-xgboost0.90,py3-xgboostlatest
tox -e py3-xgboost0.72

Local Integration Tests

Running local integration tests require Docker and AWS credentials, as the local integration tests make calls to a couple AWS services. The local integration tests and SageMaker integration tests require configurations specified within their respective conftest.py.

Before running local integration tests:

  1. Build your Docker image.
  2. Pass in the correct pytest arguments to run tests against your Docker image.

If you want to run local integration tests, then use:

# Required arguments for integration tests are found in test/conftest.py

pytest test/integration/local --docker-base-name <your_docker_image> \
                  --tag <your_docker_image_tag> \
                  --py-version <2_or_3> \
                  --framework-version <xgboost-version>

# Example pytest test/integration/local --docker-base-name preprod-xgboost-container \ --tag 1.7-1-cpu-py3 \ --py-version 3 \ --framework-version 1.7-1

SageMaker Integration Tests

SageMaker integration tests require your Docker image to be within an Amazon ECR repository <https://docs .aws.amazon.com/AmazonECS/latest/developerguide/ECS_Console_Repositories.html>__.

The Docker base name is your ECR repository namespace <https://docs.aws.amazon .com/AmazonECR/latest/userguide/Repositories.html>__.

The instance type is your specified Amazon SageMaker Instance Type that the SageMaker integration test will run on.

Before running SageMaker integration tests:

  1. Build your Docker image.
  2. Push the image to your ECR repository.
  3. Pass in the correct pytest arguments to run tests on SageMaker against the image within your ECR repository.

If you want to run a SageMaker integration end to end test on Amazon SageMaker, then use:

# Required arguments for integration tests are found in test/conftest.py

pytest test/integration/sagemaker --aws-id <your_aws_id> \
                       --docker-base-name <your_docker_image> \
                       --instance-type <amazon_sagemaker_instance_type> \
                       --tag <your_docker_image_tag>
# Example
pytest test/integration/sagemaker --aws-id 12345678910 \
                       --docker-base-name preprod-xgboost-container \
                       --instance-type ml.m4.xlarge \
                       --tag 1.0

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

SageMaker XGboost Framework Container is licensed under the Apache 2.0 License. It is copyright 2019 Amazon .com, Inc. or its affiliates. All Rights Reserved. The license is available at: http://aws.amazon.com/apache2.0/

sagemaker-xgboost-container's People

Contributors

aheinerm avatar amzn-choeric avatar aws-patlin avatar awsbmillare avatar balajitummala avatar cbalioglu avatar dependabot[bot] avatar dewan-c avatar ericangelokim avatar eugskim avatar haixiw avatar huibinshen avatar iaroslav-ai avatar itamargolbary avatar iyerr3 avatar jinyoung-lim avatar jpeddicord avatar lihaife avatar mabunday avatar malav-shastri avatar ndodda-amazon avatar nihalharish avatar nikhilraverkar avatar nooitz avatar rizwangilani avatar salmankhurshid1 avatar vjain1299 avatar wiltonwu avatar yusharon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-xgboost-container's Issues

XGBoost 1.7 multi-instance GPU training with Dask - NCCL error

Hello, we tried out multi instance GPU training(ex: 2 X g4dn.2xlarge), with Dask, with XGBoost 1.7.4, and we get the following error. But the same training setup works if we use 1.5.2. Is there any change needed in inter node communication to make training work for 1.7?
We also don't see this issue for objective reg:squarederror, but it happens for binary classification tasks.

Hyperparameters used:

hyperparams = {
    "objective": "binary:logistic",
    "num_round": "1000",
    "eval_metric": "error",
    "tree_method": "gpu_hist",
    "verbosity": "3",
}

We're using SageMaker's container for XGBoost, which starts the cluster with Dask Cuda workers in each instance, and calls xgboost.dask.train after data load.
Data load completes and training starts, but after that we immediately see this error. Thanks in advance!

[18:46:49] DEBUG: ../src/tree/updater_gpu_hist.cu:751: [GPU Hist]: Configure
[18:46:50] ======== Monitor (0): SketchContainer ========
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for AllReduce did not get stopped properly.
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for MakeCuts did not get stopped properly.
[18:46:50] ScanInput: 0.000816s, 1 calls @ 816us
[18:46:50] ======== Monitor (0): ellpack_page ========
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for Quantiles did not get stopped properly.
[18:46:50] ======== Monitor (1): SketchContainer ========
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for AllReduce did not get stopped properly.
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for MakeCuts did not get stopped properly.
[18:46:50] ScanInput: 0.00082s, 1 calls @ 820us
[18:46:50] ======== Monitor (1): ellpack_page ========
[18:46:50] WARNING: ../src/common/timer.cc:43: Timer for Quantiles did not get stopped properly.
2023-03-16 18:46:52,741 - distributed.scheduler - INFO - Remove client Client-c54c72af-c42a-11ed-802f-123388d7d4a0
2023-03-16 18:46:52,751 - distributed.core - INFO - Received 'close-stream' from tcp://10.0.103.71:58544; closing.
2023-03-16 18:46:52,751 - distributed.scheduler - INFO - Remove client Client-c54c72af-c42a-11ed-802f-123388d7d4a0
2023-03-16 18:46:52,752 - distributed.scheduler - INFO - Close client connection: Client-c54c72af-c42a-11ed-802f-123388d7d4a0
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/dask_entry_script_csv.py", line 174, in <module>
    output = xgb.dask.train(client,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/miniconda3/lib/python3.8/site-packages/xgboost/dask.py", line 1057, in train
    return client.sync(
  File "/miniconda3/lib/python3.8/site-packages/distributed/utils.py", line 339, in sync
    return sync(
  File "/miniconda3/lib/python3.8/site-packages/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/miniconda3/lib/python3.8/site-packages/distributed/utils.py", line 379, in f
    result = yield future
  File "/miniconda3/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/miniconda3/lib/python3.8/site-packages/xgboost/dask.py", line 993, in _train_async
    results = await map_worker_partitions(
  File "/miniconda3/lib/python3.8/site-packages/xgboost/dask.py", line 529, in map_worker_partitions
    results = await client.gather(futures)
  File "/miniconda3/lib/python3.8/site-packages/distributed/client.py", line 2154, in _gather
    raise exception.with_traceback(traceback)
  File "/miniconda3/lib/python3.8/site-packages/xgboost/dask.py", line 960, in dispatched_train
    booster = worker_train(
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 185, in train
    bst.update(dtrain, i, obj)
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 1918, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [18:46:50] ../src/tree/updater_gpu_hist.cu:798: Exception in gpu_hist: [18:46:50] ../src/collective/../common/device_helpers.cuh:135: NCCL failure :remote process exited or there was a network error ../src/collective/nccl_device_communicator.cuh(54)
Stack trace:
  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c64d) [0x7ff0cafa964d]
  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45e3f9) [0x7ff0cafab3f9]
  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4602a6) [0x7ff0cafad2a6]
  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c9d3) [0x7ff0cafa99d3]
  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4bfb47) [0x7ff0cb00cb47]
  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c08ce) [0x7ff0cb00d8ce]
  [bt] (6) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x469e16) [0x7ff0cafb6e16]
  [bt] (7) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4e9e68) [0x7ff0cb036e68]
  [bt] (8) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4ea6ae) [0x7ff0cb0376ae]
Stack trace:
  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x71dd19) [0x7ff0cb26ad19]
  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x73d378) [0x7ff0cb28a378]
  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a0dc9) [0x7ff0cadeddc9]
  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a1b1d) [0x7ff0cadeeb1d]
  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2dc5f7) [0x7ff0cae295f7]
  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7ff0cac7abe0]
  [bt] (6) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7ff191f5a9dd]
  [bt] (7) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7ff191f5a067]
  [bt] (8) /miniconda3/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7ff191f731e9]
2023-03-16 18:46:52,822 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-13b9cca0-1853-4239-beb0-ee5360cd3dcb
Function:  dispatched_train
args:      ({'verbosity': 3, 'objective': 'binary:logistic', 'tree_method': 'gpu_hist', 'eval_metric': 'error'}, {'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '10.0.103.71', 'DMLC_TRACKER_PORT': 46815}, 140630576889760, ['train', 'validation'], [140630576889760, 140630577940320], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': None, 'enable_categorical': False, 'parts': [{'data':        feature-01  feature-02  feature-03  ...  feature-26  feature-27  feature-28
0        0.869293   -0.635082    0.225690  ...    0.721657    0.988751    0.876678
1        0.907542    0.329147    0.359412  ...    0.779732    0.992356    0.798343
2        0.798835    1.470639   -1.635975  ...    0.803252    0.865924    0.780118
3        1.344385   -0.876626    0.935913  ...    0.869200    1.026736    0.957904
4        1.105009    0.321356    1.522401  ...    1.133295    0.872245    0.808486
...           ...         ...         ...  ...         ...         ...         ...
74635    0.9346
kwargs:    {}
Exception: "XGBoostError('[18:46:50] ../src/tree/updater_gpu_hist.cu:798: Exception in gpu_hist: [18:46:50] ../src/collective/../common/device_helpers.cuh:135: NCCL failure :remote process exited or there was a network error ../src/collective/nccl_device_communicator.cuh(54)\\nStack trace:\\n  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c64d) [0x7fb652c3a64d]\\n  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45e3f9) [0x7fb652c3c3f9]\\n  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4602a6) [0x7fb652c3e2a6]\\n  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c9d3) [0x7fb652c3a9d3]\\n  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4bfb47) [0x7fb652c9db47]\\n  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c08ce) [0x7fb652c9e8ce]\\n  [bt] (6) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x469e16) [0x7fb652c47e16]\\n  [bt] (7) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4e9e68) [0x7fb652cc7e68]\\n  [bt] (8) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4ea6ae) [0x7fb652cc86ae]\\n\\n\\n\\nStack trace:\\n  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x71dd19) [0x7fb652efbd19]\\n  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x73d378) [0x7fb652f1b378]\\n  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a0dc9) [0x7fb652a7edc9]\\n  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a1b1d) [0x7fb652a7fb1d]\\n  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2dc5f7) [0x7fb652aba5f7]\\n  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7fb65290bbe0]\\n  [bt] (6) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fb71dc349dd]\\n  [bt] (7) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fb71dc34067]\\n  [bt] (8) /miniconda3/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fb71dc4d1e9]\\n\\n')"
[2023-03-16:18:46:53:ERROR] ExecuteUserScriptError:
Command "/miniconda3/bin/python3 -m dask_entry_script_csv --eval_metric error --num_round 1000 --objective binary:logistic --tree_method gpu_hist --use_dask_gpu_training True --verbosity 3"
2023-03-16 18:46:52,735 - distributed.worker - WARNING - Compute Failed
Key:       dispatched_train-2b3c1222-a511-42bd-8712-b1fb113b1778
Function:  dispatched_train
args:      ({'verbosity': 3, 'objective': 'binary:logistic', 'tree_method': 'gpu_hist', 'eval_metric': 'error'}, {'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '10.0.103.71', 'DMLC_TRACKER_PORT': 46815}, 140630576889760, ['train', 'validation'], [140630576889760, 140630577940320], {'feature_names': None, 'feature_types': None, 'feature_weights': None, 'missing': None, 'enable_categorical': False, 'parts': [{'data':        feature-01  feature-02  feature-03  ...  feature-26  feature-27  feature-28
0        0.705317    0.328173   -0.949611  ...    1.471880    0.931361    0.814569
1        0.555799   -1.505810   -0.000796  ...    0.791748    0.872126    0.772578
2        0.531824    0.360314    1.485226  ...    2.213101    1.279034    1.058182
3        0.935909   -0.261078    1.397557  ...    0.145946    1.016972    0.850355
4        2.040917    0.534655   -1.072791  ...    0.392952    0.766512    0.806905
...           ...         ...         ...  ...         ...         ...         ...
74386    0.5561
kwargs:    {}
Exception: "XGBoostError('[18:46:50] ../src/tree/updater_gpu_hist.cu:798: Exception in gpu_hist: [18:46:50] ../src/collective/../common/device_helpers.cuh:135: NCCL failure :remote process exited or there was a network error ../src/collective/nccl_device_communicator.cuh(54)\\nStack trace:\\n  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c64d) [0x7ff0cafa964d]\\n  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45e3f9) [0x7ff0cafab3f9]\\n  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4602a6) [0x7ff0cafad2a6]\\n  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x45c9d3) [0x7ff0cafa99d3]\\n  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4bfb47) [0x7ff0cb00cb47]\\n  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4c08ce) [0x7ff0cb00d8ce]\\n  [bt] (6) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x469e16) [0x7ff0cafb6e16]\\n  [bt] (7) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4e9e68) [0x7ff0cb036e68]\\n  [bt] (8) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x4ea6ae) [0x7ff0cb0376ae]\\n\\n\\n\\nStack trace:\\n  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x71dd19) [0x7ff0cb26ad19]\\n  [bt] (1) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x73d378) [0x7ff0cb28a378]\\n  [bt] (2) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a0dc9) [0x7ff0cadeddc9]\\n  [bt] (3) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2a1b1d) [0x7ff0cadeeb1d]\\n  [bt] (4) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x2dc5f7) [0x7ff0cae295f7]\\n  [bt] (5) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x70) [0x7ff0cac7abe0]\\n  [bt] (6) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7ff191f5a9dd]\\n  [bt] (7) /miniconda3/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7ff191f5a067]\\n  [bt] (8) /miniconda3/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7ff191f731e9]\\n\\n')"
2023-03-16 18:46:53,033 - distributed.core - INFO - Connection to tcp://10.0.103.71:8786 has been closed.
2023-03-16 18:46:53,033 - distributed.worker - INFO - Stopping worker at tcp://10.0.68.6:40789. Reason: worker-handle-scheduler-connection-broken```

How to output my own f1 score metric to XGBoost container Script mode

I am using XGBoost in script mode via Sagemaker Pipeline and want to output my own f1 score so that Sagemaker picks it and displays it in Metrics section under my Training Job. I printed my f1 score in the code as print('train-f1:'+str(train_f1)), yet I don't see Sagemaker picking it up and displaying in my Training Job in console, but it shows up in logs. I suppose that by using print statement and printing my results will make sagemaker write it to stdout and pick it up. Any help on how to do this will be appreciated.

Optimization issues with 0.90 built-in algo

Hi Team,

I trained a model using single instance which took 4 hours 42 minutes while it took 6 hours 28 minutes to train with 2 instances with distributedbykey option (2 plain csv files). Wondering if there are any optimization techniques that could be applied. Also any insights on why training time is increased for distributed training would help

Custom XGBoost optimization metrics

Hello Team,

it would be great if there were a way to contribute / use custom optimization metrics in AWS XGBoost implementation. Currently, we are heavy users of recall@precision (eg. optimize recall at 70% precision), but we don't have a way to leverage the distributed capabilities provided by your implementation with this custom optimization metric. AUCPR should be available in recent versions, but it's not the same thing. Would you be able to either enable the use of custom metrics, or enable this metric as an option in the short term?

Thank you in advance.

Specifying repo_version in SageMaker XGBoost built-in algorithm

I have been creating XGBoost container using container = get_image_uri(region, 'xgboost') so far. I understand from this doc https://docs.aws.amazon.com/en_pv/sagemaker/latest/dg/xgboost.html that if user do not specify a repo_version for the get_image_uri API, it points to XGBoost version 0.72

What should I specify in repo_version if I want to be explicit about it pointing to XGBoost version 0.72? Is it repo_version = '0.72-1'? I can't seem to consistently pass all the trainings with it.

Also do you have a list of repo_version available for SageMaker XGBoost built-in algorithm?

Discrepancy in validation log loss between training and endpoint prediction

I am using the following to code to run a binary classification problem in Sagemaker

image

During Training this is the log being generated

image

However when I am trying to check the validation log loss from the predictions generated using endpoint I am getting a completely different result

image

I am not sure if I am extracting the prediction probabilities correctly

Customizing f1 scoring function in custom_metrics.py

Hi,

Can you please extend the definition of f1 scoring method in custom_metrics.py such that the user can pass any value in [None, โ€˜binaryโ€™ (default), โ€˜microโ€™, โ€˜macroโ€™, โ€˜samplesโ€™, โ€˜weightedโ€™] as the "average" parameter ? In the current implementation, the "average" parameter is hard-coded.

Thanks!

n_estimator parameter?

I cannot find a way to pass the n_estimator parameter to the algorithm. Is there a way?

Private docker

Hi,

We are creating a xgboost model for our client and I'm trying to enable it with hyperparameter tuning. For security reasons, they require that we tap into their private docker repository on Artifactory instead of ECR. I've tried numerous ways by using the Estimator class as well as create_hyperparameter_tuning_job, but I keep getting the following error and I don't know where I can modify it to inject as a parameter or input json.

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: TrainingImageConfig with TrainingRepositoryAccessMode set to VPC must be provided when using a training image from a private Docker registry. Please provideTrainingImageConfig and TrainingRepositoryAccessMode set to VPC when using a training image from a private Docker registry.

Below is my current attempt using create hyperparameter tuning job. This is driving me crazy that I can't inject an image config along with hyperparameter tuning.

import sagemaker
import boto3

from sagemaker import get_execution_role

role = get_execution_role()

smclient = boto3.client("sagemaker")

tuning_job_config = {
    "ParameterRanges": {
        "ContinuousParameterRanges": [
            {
                "MinValue": "1",
                "MaxValue": "10",
                "Name": "min_child_weight",
            },
            {
                "MinValue": "1",
                "MaxValue": "10",
                "Name": "subsample",
            },
            {
                "MinValue": "0",
                "MaxValue": "1",
                "Name": "eta",
            }
        ],
        "IntegerParameterRanges": [
            {
                "MinValue": "1",
                "MaxValue": "10",
                "Name": "max_depth",
            },
            {
                "MinValue": "1",
                "MaxValue": "10",
                "Name": "gamma",
            },
            {
                "MinValue": "1",
                "MaxValue": "5",
                "Name": "reg_alpha",
            }
        ],
    },
    "ResourceLimits": {"MaxNumberOfTrainingJobs": 9, "MaxParallelTrainingJobs": 3},
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {"MetricName": "validation:auc", "Type": "Maximize"}
}

training_config = {
    "AlgorithmSpecification": {
        "TrainingImage": "<private docker registry image>",
        "MetricDefinitions":  [
    {'Name': 'train:f1','Regex':'train-f1:([0-9]+\.[0-9]*)'},
    {'Name': 'validation:f1','Regex':'validation-f1:([0-9]+\.[0-9]*)'},
    {'Name': 'train:auc','Regex':'train-auc:([0-9]+\.[0-9]*)'},
    {'Name': 'validation:auc','Regex':'validation-auc:([0-9]+\.[0-9]*)'},
    {'Name': 'train:accuracy','Regex':'train-accuracy:([0-9]+\.[0-9]*)'},
    {'Name': 'validation:accuracy','Regex':'validation-accuracy:([0-9]+\.[0-9]*)'},
    {'Name': 'train:precision','Regex':'train-precision:([0-9]+\.[0-9]*)'},
    {'Name': 'validation:precision','Regex':'validation-precision:([0-9]+\.[0-9]*)'},
    {'Name': 'train:recall','Regex':'train-recall:([0-9]+\.[0-9]*)'},
    {'Name': 'validation:recall','Regex':'validation-recall:([0-9]+\.[0-9]*)'}],
        "TrainingInputMode": "File",
    },
    "OutputDataConfig": {"S3OutputPath": "s3://attrion-model/output"},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.m4.xlarge","VolumeSizeInGB": 50},
    "RoleArn": role,
    "StaticHyperParameters": {
        "num_round": "100",
        "objective": "binary_logistic",
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 43200}
    "ImageConfig":  {"RepositoryAccessMode":"VPC"}
}

smclient.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName="attrition-model",
    HyperParameterTuningJobConfig=tuning_job_config,
    TrainingJobDefinition=training_config)

Missing proper error message when an XGBoost model file can't be found in the tar.gz file

When uploading a model tar.gz file where the model file isn't in the root of the file, there's no proper error logged. Happened to me when I placed the model in a folder rather than in the root of the model.tar.gz.
Technically, the next() throws a StopIteration error.

Create model always use estimator's entrypoint

Hi, I split train & inference logics into two separate entrypoint files. However, when create a model, the model always carries-over the train entrypoint, instead of using what's specified in create_model(entry_point='inference.py').

Kindly see the behavior shown in the screenshot, which was taken from a SageMaker notebook instance running in us-east-1.

Screen Shot 2020-05-18 at 07 50 02 PM

One less column in inference set does not throw an exception when it should

We discovered a situation where if the inference csv dataset has 1 less column than the training dataset The Exception

Unable to evaluate payload provided: Feature size of csv inference data ... is not consistent with feature size of trained model ...

is not thrown. On the other hand If it has 2 less columns the Exception is thrown.

This can be dangerous and accept inference data that can be nonsense in relation to the training data if the user doesn't implement their own check.

We are using this endpoint: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html which i assume is using this container.

The issue can be replicated if you train a model with 1 predicted_feature + x predictor_features and then use this models artefact to predict to a dataset with 1 ID feature + x -1 predictor features with transform endpoint parameters

   "DataProcessing": { 
      "InputFilter": "$[1:]",
      "JoinSource": Input,
      "OutputFilter": "$[0, -1]"
}

What happens in my case:

Using the parameters

   "DataProcessing": { 
      "InputFilter": "$[1:]",
      "JoinSource": Input,
      "OutputFilter": "$[0, -1]"
}

It runs through without problems

while if i use

   "DataProcessing": { 
      "InputFilter": "$[2:]",
      "JoinSource": Input,
      "OutputFilter": "$[0, -1]"
}

I get:

Unable to evaluate payload provided: Feature size of csv inference data 96 is not consistent with feature size of trained model 98.

This tells me using the InputFilter index 1 should allready give me the error

Unable to evaluate payload provided: Feature size of csv inference data 97 is not consistent with feature size of trained model 98.

but it doesn't

KeyError: 'S3DistributionType' when trying to use the container via a SageMaker local-mode estimator

Hello,

I built the image with version 1.2-1 and pushed it to ECR and would like to call it via a SageMaker estimator in local mode; something like:

xgb = Estimator(
    image_uri=ECR_URI
    role='arn:aws:iam::111111111111:role/service-role/role-name',
    instance_count=1,
    hyperparameters={
        "objective": "reg:squarederror",
        "num_round": 10,
    },
    instance_type="local",
    input_mode="File",
    output_path='file://.',
)

train_dataset = TrainingInput(f'file://{local_train}', content_type="text/csv")

xgb.fit({"train": train_dataset}, wait=True)

This syntax works fine with simple custom-built images. For experimentation; I'd like to also use it with an image built from the current repo. But when running this I get:

(...)
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
ERROR:sagemaker-containers:Reporting training FAILURE
ERROR:sagemaker-containers:framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
    train(framework.training_env())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
    run_algorithm_mode()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
    checkpoint_config=checkpoint_config
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 115, in sagemaker_train
    validated_data_config = channels.validate(data_config)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 106, in validate
    channel_obj.validate(value)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_algorithm_toolkit/channel_validation.py", line 52, in validate
    if (value[CONTENT_TYPE], value[TRAINING_INPUT_MODE], value[S3_DIST_TYPE]) not in self.supported:
KeyError: 'S3DistributionType'
(...)

I tried specifying distribution="FullyReplicated" in the TrainingInput but got the same error. Any guidance on what I may have done wrong?

Thanks a lot in advance

Edit: I'm using SageMaker 2.19.

Create functions for comparing content/accept types

There are multiple examples of comparing content types with raw string values (if content_type == 'text/csv') leading to brittle code.
These should be abstracted to appropriate functions for each supported content type preferably using enums

Multiclass Classification with multi:softprob objective and auc eval_metric

I'm trying to create a sagemaker training job with the following hyperparameters,

"hyperparameters": {
        "learning_rate": 0.1,
        "num_round": 250,
        "max_depth": 4,
        "objective": "multi:softprob",
        "eval_metric": "auc",  
        "num_class": 3
}

which worked when directly using xgb but failed on sagemaker with the follow error,
sagemaker_algorithm_toolkit.exceptions.UserError: Metric 'auc' can only be applied for classification and ranking problems

Seems like xgb supports auc as eval_metric with multi:softprob objective according to xgb params but the xgb sagemaker container does not. Any workaround for this? Should I be using another eval_metric instead?

XGBoost 0.90's `aucpr` evaluation metric is not supported by Sagemaker XGBoost image 0.90

XGBoost 0.90 supports aucpr as a valid value for the eval_metric learning task parameter. Please see: https://github.com/dmlc/xgboost/blob/release_0.90/doc/parameter.rst#learning-task-parameters

However, when we try to use it for Sagemaker's XGBoost image (versions 0.90-1 and 0.90-2) training fails with the following message: Hyperparameter eval_metric: value ['aucpr'] not in range ['rmse', 'mae', 'logloss', 'error', 'merror', 'mlogloss', 'auc', 'ndcg', 'map', 'poisson-nloglik', 'gamma-nloglik', 'gamma-deviance', 'tweedie-nloglik', 'accuracy', 'f1', 'mse']

More complete error log:

ERROR:sagemaker-containers:Reporting training FAILURE
ERROR:sagemaker-containers:framework error:
Traceback (most recent call last):
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 81, in train
    entrypoint()
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
    train(framework.training_env())
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
    run_algorithm_mode()
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
    checkpoint_config=checkpoint_config
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 88, in sagemaker_train
    validated_train_config = hyperparameters.validate(train_config)
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 225, in validate
    self.hyperparameters[hp].validate_range(value)
  File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 176, in validate_range
    raise exc.UserError("Hyperparameter 
{}
: value 
{}
 not in range 
{}
".format(self.name, value, self.range))
sagemaker_algorithm_toolkit.exceptions.UserError: Hyperparameter eval_metric: value ['aucpr'] not in range ['rmse', 'mae', 'logloss', 'error', 'merror', 'mlogloss', 'auc', 'ndcg', 'map', 'poisson-nloglik', 'gamma-nloglik', 'gamma-deviance', 'tweedie-nloglik', 'accuracy', 'f1', 'mse']
Hyperparameter eval_metric: value ['aucpr'] not in range ['rmse', 'mae', 'logloss', 'error', 'merror', 'mlogloss', 'auc', 'ndcg', 'map', 'poisson-nloglik', 'gamma-nloglik', 'gamma-deviance', 'tweedie-nloglik', 'accuracy', 'f1', 'mse']```

Unsupported Hyperparamters???

When you get the image on SageMaker using get_image_uri, it supports num_parallel_tree and epoch hyperparameters. There's no epoch in XGBoost documentation, but num_parallel_tree is there. When I tried to use mme version of the container (from both mme and mme_support_xgb1.0 branches), I couldn't find the num_parallel_tree and setting it causes an error in the training job.

I think this one, num_parallel_tree is missing and is needed to be added, if possible.

IndexError on endpoint deployment

Getting this error when deploying a model to an endpoint on SageMaker, not sure what might be causing it? Maybe wrong model directory, hoping someone can point me in the right direction?

  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/serve.py", line 71, in load_model
    cls.booster, cls.format = serve_utils.get_loaded_booster(ScoringService.MODEL_PATH, ensemble)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/serve_utils.py", line 192, in get_loaded_booster
    return (models, model_formats) if ensemble and len(models) > 1 else (models[0], model_formats[0])

IndexError: list index out of range

Need support for XGBoost 1.5.2

Hi!

Can we please add support for XGBoost 1.5.2 (the latest release as of now)? It has a couple of bug fixes & optimizations.

Thanks!

How to hack the output model by docker?

I am trying to deploy the sagemaker-xgboost-container on my local machine, I have modified the final/Dockerfile.cpu and add the Training Data and hyperparameters and inputdataconfig directly to the docker inside, like this๏ผš

FROM xgboost-container-base:1.0-1-cpu-py3
ENV SAGEMAKER_XGBOOST_VERSION 1.0-1

########################
# Install dependencies #
########################
COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt && rm /requirements.txt


#################
# Training Data #
#################
RUN mkdir -p /opt/ml/input/data/

RUN mkdir -p /opt/ml/input/data/training
ADD newdocker/train_example  /opt/ml/input/data/training/example

RUN mkdir -p /opt/ml/input/data/validation
ADD newdocker/validation_example  /opt/ml/input/data/validation/example

###################################
# inputdataconfig hyperparameters #
###################################
RUN mkdir -p /opt/ml/input/config
ADD newdocker/hyperparameters.json  /opt/ml/input/config/hyperparameters.json

ADD newdocker/inputdataconfig.json  /opt/ml/input/config/inputdataconfig.json

###########################
# Copy wheel to container #
###########################
COPY dist/sagemaker_xgboost_container-2.0-py2.py3-none-any.whl /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl
RUN pip install --no-cache /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl && \
    rm /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl

##############
# DMLC PATCH #
##############
# TODO: remove after making contributions back to xgboost for tracker.py
COPY src/sagemaker_xgboost_container/dmlc_patch/tracker.py \
   /miniconda3/lib/python3.6/site-packages/xgboost/dmlc-core/tracker/dmlc_tracker/tracker.py

# Include DMLC python code in PYTHONPATH to use RabitTracker
ENV PYTHONPATH=$PYTHONPATH:/miniconda3/lib/python3.6/site-packages/xgboost/dmlc-core/tracker

#######
# MMS #
#######
# Create MMS user directory
RUN useradd -m model-server
RUN mkdir -p /home/model-server/tmp && chown -R model-server /home/model-server

# Copy MMS configs
COPY docker/$SAGEMAKER_XGBOOST_VERSION/resources/mms/config.properties.tmp /home/model-server
ENV XGBOOST_MMS_CONFIG=/home/model-server/config.properties

# Copy execution parameters endpoint plugin for MMS
RUN mkdir -p /tmp/plugins
COPY docker/$SAGEMAKER_XGBOOST_VERSION/resources/mms/endpoints-1.0.jar /tmp/plugins
RUN chmod +x /tmp/plugins/endpoints-1.0.jar

the data was generated by https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/scripts/upload_xgboost_mnist_dataset/upload_xgboost_mnist_dataset and I download to my local machine.
the inputdataconfig.json is

{
  "train": {
    "ContentType": "text/csv",
    "TrainingInputMode": "File",
    "S3DistributionType": "FullyReplicated",
    "RecordWrapperType": "None"
  },
  "validation": {
    "ContentType": "text/csv",
    "TrainingInputMode": "File",
    "S3DistributionType": "FullyReplicated",
    "RecordWrapperType": "None"
  }
}

the hyperparameters.json is

[
  {
    "name": "max_depth",
    "value": "5"
  },
  {
    "name": "eta",
    "value": "0.2"
  },
  {
    "name": "gamma",
    "value": "4"
  },
  {
    "name": "min_child_weight",
    "value": "6"
  },
  {
    "name": "silent",
    "value": "0"
  },
  {
    "name": "objective",
    "value": "multi:softmax"
  },
  {
    "name": "num_class",
    "value": "10"
  },
  {
    "name": "num_round",
    "value": "10"
  }
]

after I docker built it and run it without train, I can see these files in the folder it should be.
but if I run docker run ** train, it exited directly.
the log of the cotainer is :

[
    {
        "Id": "e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb",
        "Created": "2020-08-20T07:38:15.230993766Z",
        "Path": "train",
        "Args": [],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2020-08-20T07:38:15.524214076Z",
            "FinishedAt": "2020-08-20T07:38:16.325803362Z"
        },
        "Image": "sha256:325747bbe10f5963a87039deb5a7483fede3d6a918d741f71db717833f8b47b3",
        "ResolvConfPath": "/var/lib/docker/containers/e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb/hostname",
        "HostsPath": "/var/lib/docker/containers/e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb/hosts",
        "LogPath": "/var/lib/docker/containers/e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb/e535e6610cdcc3a6b07b1584b50ac4d0b2627d0ca0bd585024566aff0467ffbb-json.log",
        "Name": "/awesome_booth",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": null,
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Capabilities": null,
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/3d1feb8aa947bd66853114fba9abe79dcffafa9e787e3eda7828593bbdefff26-init/diff:/var/lib/docker/overlay2/224a477540cacc8dab234039e8f50182048f3bd32f6947f0da6328ab2a0656b6/diff:/var/lib/docker/overlay2/7dccea30bc991c0fcecbacd13bbe3129be78320c18d78ba53c299d0e64b0d15c/diff:/var/lib/docker/overlay2/4227f662706c14df39dd3a2a0ff4693013bd391abafba78f63dae409a19b8bec/diff:/var/lib/docker/overlay2/1d08f0fcfefa265082f9ac33844e959984051833f7ec012962254ae7734dfa4d/diff:/var/lib/docker/overlay2/a323afc819c0fe86282f42ca3cef5bf01cecc4b7d6ff385c5295f170710e0011/diff:/var/lib/docker/overlay2/9908a46f27a4db59fca7577f6127ef6d8dd671dda61c3fb156c807fd4489f147/diff:/var/lib/docker/overlay2/cc4a502b40520f848b54df8ca61da8260b14c3649f609dbc685c88226125ed1b/diff:/var/lib/docker/overlay2/2acd1f448bd9c351d1e7882782776449fb76c85df9d9ebd265d90815307e3b28/diff:/var/lib/docker/overlay2/e184edf4f9d2ff545f0b2744ee20d3bd10bb474f5600337bdb9512e2eca4e66f/diff:/var/lib/docker/overlay2/976d630fed124e819824eff907e6694b6d0abb219750a2e7b3f644fc15977c6f/diff:/var/lib/docker/overlay2/b8bfffca81c6419e4b3e2792c66c6a458ad369212e5e3b1e172a1baf504da0f5/diff:/var/lib/docker/overlay2/1e069e2aa0085520611299644fb450036a5bad2c9f14aca6b407c1ab1c60dd99/diff:/var/lib/docker/overlay2/a844fc0640396eb03f910487041bcd365bae3004bd9f78e1e834a25f3ca95c51/diff:/var/lib/docker/overlay2/df83d1d8694fa2d29198008200bface3da26001cbffd08e4028c70dc0a26dcbf/diff:/var/lib/docker/overlay2/dfc3339e4797f31a0642d21ae8a0dc150626d5040d5caf1bbe49334a2111c7f9/diff:/var/lib/docker/overlay2/25a105da83593b8b1e19c97a4e4a28b34e1c93ba4d92803b03d5bde3b07e0a77/diff:/var/lib/docker/overlay2/426e5f518b7028b6c6c293d125d2609835f184ac948e0b7af9ad1221caf947a0/diff:/var/lib/docker/overlay2/4937413589e8615ea64e20af96394bde76b2905a86ebb188c8c5cb1a6b2df5fb/diff:/var/lib/docker/overlay2/18b8ec97de78123fdc08da0670bd3d20806094d05ecff023d7d753edc3ad4cf2/diff:/var/lib/docker/overlay2/828dbce5d9c8dec2e1e399bcc2c830457a493000109dccc55b769a98d9c5f3f9/diff:/var/lib/docker/overlay2/5f48f1359ca91cea9eca5cccfecffea8cd93f1ce7f3f0f5df264bbbb1527e7a9/diff:/var/lib/docker/overlay2/9e07bc3c2dde7c853cc9f1d6d8f27a45ea617191f46d59c9b441cbcad6371b97/diff:/var/lib/docker/overlay2/fbc2a67dfc9ba95e15cd12fda2cf46360ec6cb2e3664d2435fa3a973dc20b2aa/diff:/var/lib/docker/overlay2/56f483511b811a7ca0de41e7ea05cc3857afca8c41d977844bb8cf32761c7922/diff:/var/lib/docker/overlay2/4d4b18f9a88abbd06e3652b24d6a90456c0b2c76f0c8138082a0cc4ad5e4abc8/diff:/var/lib/docker/overlay2/abd84d160b84b572a26379a4fbb4fd1ee19e07f13861bc2799f438c52ebe7ed4/diff:/var/lib/docker/overlay2/669a8bd672061db59d3f4017b05ac7d8599bf2ffb498e88338b9de031d981e48/diff:/var/lib/docker/overlay2/17ff07176618a83e3afe73329a854ad076d522cc61b027e97329afa690f1e614/diff:/var/lib/docker/overlay2/87aa53dcc1a8c065622cbf06f797b35623d0b2e67e9d9944487c2b21737fac3f/diff:/var/lib/docker/overlay2/18e79aeeebdd9d3c7325a52156ea7f40f2366a0ff4101c1a913d7d96ca6759a3/diff:/var/lib/docker/overlay2/e91627a83e914c8bf19d346da45704140bafaab3902cc3ce3fbe2f4d994088f5/diff:/var/lib/docker/overlay2/3c0fc3448b64893f45e74547b966aae2ac1c94b70396d2f347bd3b57be8fb4e5/diff",
                "MergedDir": "/var/lib/docker/overlay2/3d1feb8aa947bd66853114fba9abe79dcffafa9e787e3eda7828593bbdefff26/merged",
                "UpperDir": "/var/lib/docker/overlay2/3d1feb8aa947bd66853114fba9abe79dcffafa9e787e3eda7828593bbdefff26/diff",
                "WorkDir": "/var/lib/docker/overlay2/3d1feb8aa947bd66853114fba9abe79dcffafa9e787e3eda7828593bbdefff26/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [],
        "Config": {
            "Hostname": "e535e6610cdc",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "8080/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/miniconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "PYTHONDONTWRITEBYTECODE=1",
                "PYTHONUNBUFFERED=1",
                "PYTHONIOENCODING=utf-8",
                "SAGEMAKER_XGBOOST_VERSION=1.0-1",
                "PYTHONPATH=:/miniconda3/lib/python3.6/site-packages/xgboost/dmlc-core/tracker",
                "XGBOOST_MMS_CONFIG=/home/model-server/config.properties",
                "SM_INPUT=/opt/ml/input",
                "SM_INPUT_TRAINING_CONFIG_FILE=/opt/ml/input/config/hyperparameters.json",
                "SM_INPUT_DATA_CONFIG_FILE=/opt/ml/input/config/inputdataconfig.json",
                "SM_CHECKPOINT_CONFIG_FILE=/opt/ml/input/config/checkpointconfig.json",
                "SM_MODEL_DIR=/opt/ml/model",
                "SAGEMAKER_TRAINING_MODULE=sagemaker_xgboost_container.training:main",
                "SAGEMAKER_SERVING_MODULE=sagemaker_xgboost_container.serving:main",
                "TEMP=/home/model-server/tmp"
            ],
            "Cmd": [
                "train"
            ],
            "Image": "local-train-xgboost-container:1.0-1-cpu-py3",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {
                "com.amazonaws.sagemaker.capabilities.accept-bind-to-port": "true",
                "com.amazonaws.sagemaker.capabilities.multi-models": "true"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "b8c9fa016cbd460136dc108e770c715264fafb6fb879e92d9d70a7127744d0fa",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/b8c9fa016cbd",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "7cd6eed1f6d9cbb25945b0bee73e3e2f7e752e4d14a14cc7f8e9aa01ee607727",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

seems like it didn't train at all, what should I do to train it locally and export the model file without using s3 or IAM ?

How to pass qid/group in XGBRanker.fit() ??

Hi!

I have been trying to use XGBRanker present in XGBoost via Sagemaker.

However, I could not find figure out a way to pass qid/group to sagemaker.xgboost.estimator.XGBoost.fit().
Any idea how to pass qid/group to XGBRanker??

Thanks!
Rama

XGBoost 1.3.0 in SageMaker

Any plans to update to 1.3? We're using GPU TreeSHAP and it's available in this version. Currently using our custom container but wondering if there's a roadmap for updating the official SageMaker XGB image?

application/json is not an accepted on tag 1.5-1

I got the following error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (415) from primary with message "application/json is not an accepted ContentType: csv, libsvm, parquet, recordio-protobuf, text/csv, text/libsvm, text/x-libsvm, application/x-parquet, application/x-recordio-protobuf.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/test-deposit-fraud-20210609-V6-N-endpoint-4 in account 646688815978 for more information.
---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
<ipython-input-66-9477d001205b> in <module>
      1 runtime   = boto3.Session().client('sagemaker-runtime')
      2 response = runtime.invoke_endpoint(
----> 3     EndpointName=f'test-{model_name}-endpoint-4', ContentType="application/json", Body=payload)
      4 result       = np.array(
      5     response['Body'].read().decode().strip().split("\n")

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    528                 )
    529             # The "self" in this scope is referring to the BaseClient.
--> 530             return self._make_api_call(operation_name, kwargs)
    531 
    532         _api_call.__name__ = str(py_operation_name)

/opt/conda/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    958             error_code = parsed_response.get("Error", {}).get("Code")
    959             error_class = self.exceptions.from_code(error_code)
--> 960             raise error_class(parsed_response, operation_name)
    961         else:
    962             return parsed_response

when doing

runtime   = boto3.Session().client('sagemaker-runtime')
response = runtime.invoke_endpoint(
    EndpointName=f'test-{model_name}-endpoint-4', ContentType="application/json", Body=payload)
result       = np.array(
    response['Body'].read().decode().strip().split("\n")
).astype(float)
result

Having a payload in the following format

payload = json.dumps(dummy_df[features].iloc[:2].to_json())
payload[:1000]
'"{\\"feature_1\\":{\\"0\\":0.0370342818,\\"1\\":0.0813736376},\\"feature_2\\":{\\"0\\":0.3092200105,\\"1\\":0.1938206656},.....

Can not load models pulled from ML Flow

Context:
Models trained in SageMaker, but stored in ML Flow model registry, have multiple additional files, for example some metadata, requirements.txt, etc.

When it's zipped and provided as a SageMaker model, this function tries to read all of the files as if they were models:

booster, format = serve_utils.get_loaded_booster(model_dir, serve_utils.is_ensemble_enabled())

But they are not models.

It should have some kind of pattern filter or a way to provide the file name, and not the folder. Otherwise it fails to load a model from a requirements.txt file which is just additional information.

`text/csv;label_size=0` for inference jobs doesn't work anymore

It is stated in Sagemaker's documentation that we need to use text/csv;label_size=0 which we have been doing successfully for a while.
But starting from today/late yesterday inference jobs fail with error message: text/csv;label_size=0 is not an accepted csv ContentType. Optional parameter label_size must be equal to 1. I assume this is because of recent changes in src/sagemaker_xgboost_container/data_utils.py merged to master branch.
What should we and others using the latest xgboost image do, remove label_size or stick to documentation and the xgboost image code will also follow it?

'XGBRegressor' object has no attribute 'set_param' | AttributeError: 'XGBRegressor' object has no attribute 'set_param'

I am having the following issue when using tag 1.5-1 which has xgboost==1.5.2 installed based on the dockerfile. The model I'm trying to load was trained using version 1.5.0.

Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/arbiter.py", line 586, in spawn_worker
    worker.init_process()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/ggevent.py", line 203, in init_process
    super(GeventWorker, self).init_process()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/base.py", line 135, in init_process
    self.load_wsgi()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/miniconda3/lib/python3.8/site-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/serving.py", line 29, in <module>
    from sagemaker_xgboost_container.algorithm_mode import serve
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/__init__.py", line 23, in <module>
    serve.load_model()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/serve.py", line 135, in load_model
    return ScoringService.load_model(ensemble=serve_utils.is_ensemble_enabled())
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/serve.py", line 71, in load_model
    cls.booster, cls.format = serve_utils.get_loaded_booster(ScoringService.MODEL_PATH, ensemble)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/serve_utils.py", line 188, in get_loaded_booster
    booster.set_param("nthread", 1)
CopyAttributeError: 'XGBRegressor' object has no attribute 'set_param' | AttributeError: 'XGBRegressor' object has no attribute 'set_param'

Weirdly, when I try to install 1.5.2 and load it on a notebook, it works fine.

image

Custom objective function and receive error

Hello Team,

I'm training a XGBoost model using sagemaker container. While I need to use a customized objective function, I received an error when running the code. The error message is as following:
INFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode
ERROR:sagemaker-containers:Reporting training FAILURE
ERROR:sagemaker-containers:framework error:
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py", line 84, in train
entrypoint()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 94, in main
train(framework.training_env())
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 90, in train
run_algorithm_mode()
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode
checkpoint_config=checkpoint_config
File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 110, in sagemaker_train
validated_train_config = hyperparameters.validate(train_config)
File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 289, in validate
self.hyperparameters[hp].validate_range(value)
File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/hyperparameter_validation.py", line 47, in validate_range
raise exc.UserError("Hyperparameter {}: {} is not in {}".format(self.name, value, self.range))
sagemaker_algorithm_toolkit.exceptions.UserError: Hyperparameter objective: <function log_cosh_quantile_95 at 0x7f92a2dc8d90> is not in ['binary:logistic', 'binary:logitraw', 'binary:hinge', 'count:poisson', 'multi:softmax', 'multi:softprob', 'rank:pairwise', 'rank:ndcg', 'rank:map', 'reg:linear', 'reg:squarederror', 'reg:logistic', 'reg:gamma', 'reg:squaredlogerror', 'reg:tweedie', 'survival:cox']

Hyperparameter objective: <function log_cosh_quantile_95 at 0x7f92a2dc8d90> is not in ['binary:logistic', 'binary:logitraw', 'binary:hinge', 'count:poisson', 'multi:softmax', 'multi:softprob', 'rank:pairwise', 'rank:ndcg', 'rank:map', 'reg:linear', 'reg:squarederror', 'reg:logistic', 'reg:gamma', 'reg:squaredlogerror', 'reg:tweedie', 'survival:cox']

I tested my objective func using the XGBoost.XGBRegressor() and it worked well. I'll appreciate it if you can provide any guidance.

Thanks

Upgrade to multi-model-server 1.1.8

Hello,

Can we expect an upgrade to multi-model-server 1.1.8 ? It contains, among other things, an upgrade from log4j 1.x to 2.17.1.

For now I have extended the current 1.3-1 Docker image and manually updated to 1.1.8. Everything is working as expected but logs are lost. I believe this is due to the log4j configuration file breaking change between 1.x and 2.x (log4.properties to log4j2.xml).

model_server.py:

...
MMS_CONFIG_FILE = os.path.join('/etc', 'sagemaker-mms.properties')
DEFAULT_MMS_CONFIG_FILE = pkg_resources.resource_filename(sagemaker_inference.__name__, '/etc/default-mms.properties')
DEFAULT_MMS_LOG_FILE = pkg_resources.resource_filename(sagemaker_inference.__name__, '/etc/log4j.properties')
...
    mxnet_model_server_cmd = ['mxnet-model-server',
                              '--start',
                              '--mms-config', config_file,
                              '--log-config', DEFAULT_MMS_LOG_FILE,
....

I tried to override the default mm properties file by adding vmargs with specific log4j configuration file (as described here: https://github.com/awslabs/multi-model-server/blob/v1.1.8/docs/logging.md#modifying-the-behavior-of-the-logs) but unfortunately the --log-config option has the higher priority so it doesn't work.

Thank you.

Metrics Not Being Logged In Distributed Mode

I am running the SageMaker XGBoost algorithm in script mode as shown here:

https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone_dist_script_mode.ipynb

When I specify the number of instances as 1, everything works as expected.

However, whenever I specify a number of instances greater than 1, the expected metrics do not get logged anymore.

How do I view metrics like validation:auc when using distributed XGBoost in SageMaker with more than one instance?

Label concatenation fails due to incompatible array dimensions when the last array has one element

This seems to be a bug in

def get_recordio_protobuf_dmatrix(path, is_pipe=False):

When len(data) = 1 modulo BATCH_SIZE (E.g., len(data) == 10, BATCH_SIZE == 3), the last batch consists of a single element and an error like the following is thrown

all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 3 has 0 dimension(s)

Using BATCH_SIZE = 1 results in zero-dimensional arrays cannot be concatenated.

To replicate this, first create a recordio file with something like this:

# From a pypy environment.
import struct

import numpy as np
import sagemaker_containers._recordio as scr
from scipy import sparse

X = np.zeros((10, 10))
for i in range(X.shape[0]):
    X[i,i] = i

y = np.arange(len(X))
print('len(X)', len(X))

with open('test.rec', 'wb') as file:
    scr._write_spmatrix_to_sparse_tensor(file, sparse.csr_matrix(X), y)

Then, modify BATCH_SIZE to 3

# From conda environment.
import mlio
import numpy as np
import xgboost as xgb
from mlio.integ.numpy import as_numpy
from mlio.integ.scipy import to_coo_matrix

def get_recordio_protobuf_dmatrix(path, is_pipe=False):
    """Get Data Matrix from recordio-protobuf data.
    :param path: Path where recordio-protobuf formatted training data resides, either directory, file, or SageMaker pipe
    :param is_pipe: Boolean to indicate if data is being read in pipe mode
    :return: xgb.DMatrix or None
    """
    try:
        if is_pipe:
            dataset = [mlio.SageMakerPipe(path)]
            reader = mlio.RecordIOProtobufReader(dataset=dataset,
                                                 batch_size=3)
        else:
            dataset = mlio.list_files(path)
            reader = mlio.RecordIOProtobufReader(dataset=dataset,
                                                 batch_size=3)

        if reader.peek_example() is not None:
            # recordio-protobuf tensor may be dense (use numpy) or sparse (use scipy)
            if type(reader.peek_example()['values']) is mlio.core.DenseTensor:
                to_matrix = as_numpy
                vstack = np.vstack
            else:
                to_matrix = to_coo_matrix
                vstack = scipy_vstack

            all_features = []
            all_labels = []
            for example in reader:
                features = to_matrix(example['values'])
                all_features.append(features)

                labels = as_numpy(example['label_values']).squeeze()
                all_labels.append(labels)

            all_features = vstack(all_features)
            all_labels = np.concatenate(all_labels)
            dmatrix = xgb.DMatrix(all_features, label=all_labels)
            return dmatrix
        else:
            return None

    except Exception as e:
        raise # Note that I changed this line in order to avoid an import.

and run

get_recordio_protobuf_dmatrix('test.rec')

to see the error as above:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-5734ed91fb93> in <module>
----> 1 get_recordio_protobuf_dmatrix('test.rec')

<ipython-input-34-1c0deb21617c> in get_recordio_protobuf_dmatrix(path, is_pipe)
     34 
     35             all_features = vstack(all_features)
---> 36             all_labels = np.concatenate(all_labels)
     37             dmatrix = xgb.DMatrix(all_features, label=all_labels)
     38             return dmatrix

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 3 has 0 dimension(s)

The error goes away if the squeeze() is removed from from labels. This should be an easy fix: either don't squeeze, squeeze at the end, or explicitly handle 1-dim arrays.

How to use XGboost in script mode with CreateTrainingJob API?

I want to invoke CreateTrainingJob API using XGBoost container in script mode. Is it even possible? Otherwise, is it feasible to pass an estimator object (sagemaker.xgboost.estimator.XGBoost) to create training job API? I want to use XGBoost image with my own entry_point script and invoke training job from a lambda function. I do not want to extend the container for this purpose and build my own docker. Any help is appreciated. Thanks in advance

Generate a release?

Hi - would it be possible to generate a release on GitHub for version 1.0 described in the setup.py? Thanks!

Unable to Install This Repo Locally

I initially tried installing the repo locally like so:

cd sagemaker-xgboost-container
pip install -e .

It seems to have successfully went through. For example, there are no errors if I do:
from sagemaker_xgboost_container import *

However, when I try to import the following:
from sagemaker_xgboost_container.data_utils import get_dmatrix

It errors out saying there is no module named 'mlio.integer'. After doing some more digging in the documentation, I discovered that this package relies on the 'mlio' package which is only available on conda. I tried installing that package on conda using python3.6 as the environment like so:

conda install -c mlio -c conda-forge mlio-py==0.7

But now I receive the following error:

Traceback (most recent call last):
File "", line 1, in
File "/Users/tristan/opt/miniconda3/envs/python36env/lib/python3.6/site-packages/mlio/init.py", line 18, in
import mlio._core
ImportError: dlopen(/Users/tristan/opt/miniconda3/envs/python36env/lib/python3.6/site-packages/mlio/_core.cpython-36m-darwin.so, 2): Library not loaded: @rpath/libtbb.dylib
Referenced from: /Users/tristan/opt/miniconda3/envs/python36env/lib/libmlio.0.7.0.dylib
Reason: image not found

How can I fix this? Thanks!

XGB train call failed with exception: You can't mix new and old callback styles

Hello,

A month ago we had a minimum viable prototype for a SageMaker pipeline. Last Thursday our MVP was working fine. But this week the training step broke, meaning the rest of the pipeline is also broken. When we run the same code that was working fine last week, it fails this week with the following:

XGB train call failed with exception: You can't mix new and old callback styles.

This happens from within the training container. Here is the full output from the Estimator object's fit() call:

2022-08-27 00:36:36 Starting - Preparing the instances for trainingProfilerReport-1661560572: InProgress
......
2022-08-27 00:37:42 Downloading - Downloading input data
2022-08-27 00:37:42 Training - Downloading the training image......
2022-08-27 00:38:42 Training - Training image download completed. Training in progress...[2022-08-27 00:38:45.858 ip-10-0-65-114.ec2.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-08-27:00:38:45:INFO] Imported framework sagemaker_xgboost_container.training
[2022-08-27:00:38:45:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.
Returning the value itself
[2022-08-27:00:38:45:INFO] No GPUs detected (normal if no gpus installed)
[2022-08-27:00:38:45:INFO] Running XGBoost Sagemaker in algorithm mode
[2022-08-27:00:38:45:INFO] Determined delimiter of CSV input is ','
[2022-08-27:00:38:45:INFO] Determined delimiter of CSV input is ','
[2022-08-27:00:38:45:INFO] files path: /opt/ml/input/data/train
[2022-08-27:00:38:45:INFO] Determined delimiter of CSV input is ','
[2022-08-27:00:38:46:INFO] files path: /opt/ml/input/data/validation
[2022-08-27:00:38:46:INFO] Determined delimiter of CSV input is ','
[2022-08-27:00:38:46:INFO] Single node training.
[2022-08-27:00:38:46:INFO] Train matrix has 31829 rows and 125 columns
[2022-08-27:00:38:46:INFO] Validation matrix has 7959 rows
[2022-08-27 00:38:46.121 ip-10-0-65-114.ec2.internal:1 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2022-08-27 00:38:46.121 ip-10-0-65-114.ec2.internal:1 INFO hook.py:200] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2022-08-27 00:38:46.121 ip-10-0-65-114.ec2.internal:1 INFO profiler_config_parser.py:102] User has disabled profiler.
[2022-08-27 00:38:46.122 ip-10-0-65-114.ec2.internal:1 INFO hook.py:255] Saving to /opt/ml/output/tensors
[2022-08-27 00:38:46.122 ip-10-0-65-114.ec2.internal:1 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2022-08-27:00:38:46:INFO] Debug hook created from config
[2022-08-27:00:38:46:ERROR] Reporting training FAILURE
[2022-08-27:00:38:46:ERROR] framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 238, in train_job
    bst = xgb.train(train_cfg, train_dmatrix, num_boost_round=num_round-iteration, evals=watchlist,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 188, in train
    bst = _train_internal(params, dtrain,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 61, in _train_internal
    assert all(isinstance(c, callback.TrainingCallback)
AssertionError: You can't mix new and old callback styles.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/training.py", line 93, in main
    train(framework.training_env())
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/training.py", line 89, in train
    run_algorithm_mode()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/training.py", line 63, in run_algorithm_mode
    sagemaker_train(
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 178, in sagemaker_train
    train_job(**train_args)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 304, in train_job
    raise exc.AlgorithmError("{}:\n {}".format(exception_prefix, str(e)))
sagemaker_algorithm_toolkit.exceptions.AlgorithmError: XGB train call failed with exception:
 You can't mix new and old callback styles.
XGB train call failed with exception:
 You can't mix new and old callback styles.

2022-08-27 00:38:59 Uploading - Uploading generated training model
2022-08-27 00:38:59 Failed - Training job failed

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-35-fe7be97cffc7> in <module>
----> 1 estimator.fit(inputs=data_inputs)

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline_context.py in wrapper(*args, **kwargs)
    246             return self_instance.sagemaker_session.context
    247 
--> 248         return run_func(*args, **kwargs)
    249 
    250     return wrapper

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
   1062         self.jobs.append(self.latest_training_job)
   1063         if wait:
-> 1064             self.latest_training_job.wait(logs=logs)
   1065 
   1066     def _compilation_job_name(self):

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
   2145         # If logs are requested, call logs_for_jobs.
   2146         if logs != "None":
-> 2147             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2148         else:
   2149             self.sagemaker_session.wait_for_job(self.job_name)

/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3851 
   3852         if wait:
-> 3853             self._check_job_status(job_name, description, "TrainingJobStatus")
   3854             if dot:
   3855                 print()

/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3392                 message=message,
   3393                 allowed_statuses=["Completed", "Stopped"],
-> 3394                 actual_status=status,
   3395             )
   3396 

UnexpectedStatusException: Error for Training job sagemaker-xgboost-2022-08-27-00-36-12-698: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 238, in train_job
    bst = xgb.train(train_cfg, train_dmatrix, num_boost_round=num_round-iteration, evals=watchlist,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 188, in train
    bst = _train_internal(params, dtrain,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 61, in _train_internal
    assert all(isinstance(c, callback.TrainingCallback)
AssertionError: You can't mix new and old callback styles.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/training.py", line 93, in main
    train(framework.training_env())
  File "/miniconda3/

We had a kernel from last week's session (a kernel that successfully ran the pipeline) that was not cleaned up, so I was able to check the output of pip freeze from last week and compare it to the output of pip freeze from the notebook container that failed. They were exactly the same. They are also pointing at data sets that we confirmed did not change since last week. The only thing that we think could be different between then and now is the XGBoost container image.

I also launched the pipeline from a Base Python 3 image (3.6), and from a Data Science image (3.7), and even the Data Science 2.0 image, with no change in the way the pipeline failed.

It also seems that someone else reported this issue for a SageMaker example notebook yesterday: aws/amazon-sagemaker-examples#3578

(I should also mention, downgrading to previous versions of XGBoost is not possible, as we have issues with the input data format with older versions.)

TypeError: predict() got an unexpected keyword argument 'pred_contribs' with xgboost v0.90

Hi @eitansela

I am using the inference.py file and have trained my model using xgboost v0.90.

from xgboost import XGBRegressor 
model = XGBRegressor()

However, when I run the script and invoke the endpoint to make prediction, I run into the error. Here's what my inference.py code looks like:

import json
import os
from io import BytesIO
import pickle as pkl
import numpy as np
import sagemaker_xgboost_container.encoder as xgb_encoders
import xgboost as xgb
from os import listdir
from scipy import sparse


# Load your model
def model_fn(model_dir):
    """
    Deserialize and return fitted model.
    """
    model_file = "xgboost-model"
    booster = pkl.load(open(os.path.join(model_dir, model_file), "rb"))
    
    return booster



def input_fn(request_body, request_content_type):
    """
    The SageMaker XGBoost model server receives the request data body and the content type,
    and invokes the `input_fn`.
    
    Return a DMatrix (an object that can be passed to predict_fn).
    """
        
    if request_content_type == "text/csv":
       
        values = [i for i in request_body.split(',')]

        values = [val.strip() for val in values]
        
        # to 2-d numpy array
        npa = np.array(values).reshape(-1,1)
       
        return npa
    
    if request_content_type == "text/libsvm":
        
        return xgb_encoders.libsvm_to_dmatrix(request_body)
    
    else:
        raise ValueError("Content type {} is not supported.".format(request_content_type))

        
# Run Predictions
def predict_fn(input_data, model):
    """
    SageMaker XGBoost model server invokes `predict_fn` on the return value of `input_fn`.

    Return a two-dimensional NumPy array where the first columns are predictions
    and the remaining columns are the feature contributions (SHAP values) for that prediction.
    """
    
    names = model.get_booster().feature_names
    
    prediction = model.predict(input_data, validate_features=False)
    
    feature_contribs = model.predict(input_data, preds_contribs=True, validate_features=False)
   
    output = np.hstack((prediction[:, np.newaxis], feature_contribs))
    
    return output


def output_fn(predictions, content_type):
    """
    After invoking predict_fn, the model server invokes `output_fn`.
    """
    if content_type == "text/csv":
        return ",".join(str(x) for x in predictions[0])
    else:
        raise ValueError("Content type {} is not supported.".format(content_type))

It looks like the pred_contribs is not an argument for a model that is trained using XGBRegressor. It appears to work for model that trained using xgb.DMatrix(x, y).

https://xgboost.readthedocs.io/en/latest/python/examples/update_process.html?highlight=contribs

latest image and monotonicity constraint

Hi all,

I assume that when I use:

image_uri=sagemaker.image_uris.retrieve("xgboost", region_name, "latest")

It pulls the latest image from here (this repo)?

I can then be used in code along those lines:

xgb = sagemaker.estimator.Estimator(image_uri=image_uri, role=role_arn,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge',
                                          #instance_type='local', 
                                          volume_size=1, # 1 GB 
                                          output_path=output_path, base_job_name="test-monotonic")

# https://github.com/aws/sagemaker-xgboost-container/issues/120
xgb.set_hyperparameters(
        max_depth=10,
        num_round=30,
        nthread=2,
        seed=42,
        objective='count:poisson',
        # https://xgboost.readthedocs.io/en/stable/treemethod.html
        tree_method='hist', 
        eval_metric='rmse',
        # https://xgboost.readthedocs.io/en/stable/tutorials/monotonic.html
        # bottom of page
        monotone_constraints='(0,-1,0)')

xgb.fit({'train': s3_input_train, 'validation': s3_input_val})

This works fine. However, when I try to load the produced model (downloaded locally) like this:

local_model_path = "model.tar.gz"
with tarfile.open(local_model_path) as tar:
    tar.extractall()

model = xgb.XGBRegressor() 
model.load_model('xgboost-model') 

I get:

  File "C:\Python\Python310\lib\site-packages\xgboost\sklearn.py", line 736, in load_model
    self.get_booster().load_model(fname)
  File "C:\Python\Python310\lib\site-packages\xgboost\core.py", line 2249, in load_model
    _check_call(_LIB.XGBoosterLoadModel(
  File "C:\Python\Python310\lib\site-packages\xgboost\core.py", line 203, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: string too long

I am not sure what is wrong. Maybe my local package version of XGBoost is out of synch or something has changed in how the model is persisted (there was a different way of doing it in the past).

Any pointers very much appreciated. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.