microsoft / lightgbm-transform Goto Github PK

Transformation library for LightGBM

License: MIT License

CMake 1.45% C 0.16% C++ 96.01% Python 2.19% Shell 0.19%

lightgbm-transform's Introduction

LightGBM Transformation Library

The LightGBM transformation library aims at providing a flexible and automatic way to do feature transformation when using LightGBM. Compared to separate transformation. this way has several pros:

More efficient. Data preprocessing can go with parsing each line, and take advantage of multi-processing designed by lightgbm naturally. No need to store whole transformed data in file/memory.
More convenient for development and iteration. Built-in transformation could keep offline and online consistent well by saved/loaded along with model.

In the repo, user could learn about:

How to customize favorite parser by playing with LightGBM Parser interface.
How to use FreeForm2Parser, a powerful and efficient tool built in the repo, in model training. Instead of handling data yourself, user just need to prepare a feature spec which contains feature name, transform type and expression, and make slight changes in experiment iterations.

Get Started and Documentation

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

lightgbm-transform's People

Contributors

Stargazers

Watchers

Forkers

ltxtech ltxtech1999 piyushmadan isabella232 kmyfoer sarvex manishadhingra

lightgbm-transform's Issues

[feature request] Encodings for categorical features

Categorical encodings, like one-hot, frequency, and target encoding, are widely used in tabular data.
Does this repo support the categorical features?

Support for Continue Train

The lightgbm framework allows to continue training from a previous model using the init_model parameter
(docs here).

Since we often want to fine-tune rankers to accommodate for seasonal effects, this is a critical feature for us.

Issue picking the label name for ranking object

For the ranking problem, the library is not able to use the label names while it works when label and query index is provided.

Following code can be added to the tests/python_package_test to reproduce the problem.

@pytest.fixture
def rank_ds_with_header(tmp_path):
    return generate_ds_with_header( rank_ds, tmp_path)

def test_ranker_data_with_header(params, rank_ds_with_header):

    verify_file_contents_to_debug(rank_ds_with_header)

    train_data = lgb.Dataset(rank_ds_with_header.data, params={
        "parser_config_file": rank_ds_with_header.parser_config, "header": True})
    bst = lgb.train(params, train_data, valid_sets=[train_data])
    pred = bst.predict(rank_ds_with_header.data)
    print(pred)
    np.testing.assert_allclose(pred[:5], np.array([0.83267298, 0.388454, 0.35369267, 0.60330376, -1.24218415]))

def test_rank_data_with_header_and_label_name(params, rank_ds_with_header):

    verify_file_contents_to_debug(rank_ds_with_header)

    # # works halfway;  picks right index but fails to train  
    # params['label'] = "name:Rating"
    # params['query'] = "0" 

    ##doesn't work
    params['label'] = "name:Rating"
    params['query'] = "name:DocId"

    train_data = lgb.Dataset(rank_ds_with_header.data, params={
        "parser_config_file": rank_ds_with_header.parser_config, "header": True})#.construct()

    bst = lgb.train(params, train_data, valid_sets=[train_data])
    pred = bst.predict(rank_ds_with_header.data)
    np.testing.assert_allclose(pred[:5], np.array([0.83267298, 0.388454, 0.35369267, 0.60330376, -1.24218415]))


def verify_file_contents_to_debug(ds):
   # to verify if header is actually added to the 
    with open(ds.data, 'r') as f:
        print(f.read()[:400])

    import json
    parser = json.load(open(ds.parser_config))
    print("parser.keys():")
    print(parser.keys())

Freeforms issue

Description

When trying to leverage freeforms in learning to rank scenario it seems they are working only when the features used in them are also defined in parser config file as linear Transforms (last example from the attached notebook). Using freeforms with features not defined as standalone seems like the ranker does not use them at all.

Reproducible example

ff_example.zip

Environment info

lightgbm-transform version 3.3.1

Command(s) you used to install lightgbm-transform

Using docker image with
RUN pip install --upgrade pip setuptools wheel && \ pip install 'cmake==3.21.0' && \ pip install lightgbm==3.3.1 --install-option=--mpi &&\ pip install lightgbm-transform==3.3.1

Additional Comments

lightgbm.basic.LightGBMError: Label should be the first column in a LibSVM file

If both train data and valid data exists, below exception will be raised up when constructing valid data.
It will succeed if only train data exists.

 File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/engine.py", line 275, in train
    booster.add_valid(valid_set, name_valid_set)
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 2945, in add_valid
    data.construct().handle))
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 1811, in construct
    feature_name=self.feature_name, params=self.params)
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 1528, in _lazy_init
    ctypes.byref(self.handle)))
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 132, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Label should be the first column in a LibSVM file

Adding Group Information for 'lambdarank' Model

Hello,
I have a question regarding training lamdarank model. I want to pass group information by using 'query' parameter but I receive
"Number of rows ... exceeds upper limit of 10000 for a query"
error. It seems like my model does not use query information. I am not sure if other group column parameters in lightgbm are supported since all of the examples in this repo uses 'query' parameter. This is how I use it:

params = {
        #other params for training
          'query': index #where index is the column index I want to use for grouping
    }

Then I pass params to training. Do you have any idea why my group information is not accepted or is there any alternative way to give group information that is supported?

Thank you!

Update lightgbm-transform to support lightgbm v4.1.0

Description

Update lightgbm-transform to support lightgbm v4.1.0

Reproducible example

Environment info

lightgbm-transform version or commit hash: 3.3.2

Command(s) you used to install lightgbm-transform

I have setup a docker container using instructions in read me, installed lightgbm(supported version), lightgbm-transform

and ran pytest.

with Lightgbm 3.3.0:

and with lightgbm4.1:

microsoft / lightgbm-transform Goto Github PK

lightgbm-transform's Introduction

LightGBM Transformation Library

Get Started and Documentation

Contributing

Trademarks

lightgbm-transform's People

Contributors

Stargazers

Watchers

Forkers

lightgbm-transform's Issues

Description

Reproducible example

Environment info

Additional Comments

Description

Reproducible example

Environment info

Additional Comments

Recommend Projects

Recommend Topics

Recommend Org