Giter Site home page Giter Site logo

lightgbm-transform's Introduction

LightGBM Transformation Library

The LightGBM transformation library aims at providing a flexible and automatic way to do feature transformation when using LightGBM. Compared to separate transformation. this way has several pros:

  • More efficient. Data preprocessing can go with parsing each line, and take advantage of multi-processing designed by lightgbm naturally. No need to store whole transformed data in file/memory.
  • More convenient for development and iteration. Built-in transformation could keep offline and online consistent well by saved/loaded along with model.

In the repo, user could learn about:

  • How to customize favorite parser by playing with LightGBM Parser interface.
  • How to use FreeForm2Parser, a powerful and efficient tool built in the repo, in model training. Instead of handling data yourself, user just need to prepare a feature spec which contains feature name, transform type and expression, and make slight changes in experiment iterations.

Get Started and Documentation

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

lightgbm-transform's People

Contributors

chjinche avatar ltxtech avatar ltxtech1999 avatar manishadhingra avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lightgbm-transform's Issues

Support for Continue Train

The lightgbm framework allows to continue training from a previous model using the init_model parameter
(docs here).

Since we often want to fine-tune rankers to accommodate for seasonal effects, this is a critical feature for us.

Issue picking the label name for ranking object

For the ranking problem, the library is not able to use the label names while it works when label and query index is provided.

Following code can be added to the tests/python_package_test to reproduce the problem.

@pytest.fixture
def rank_ds_with_header(tmp_path):
    return generate_ds_with_header( rank_ds, tmp_path)

def test_ranker_data_with_header(params, rank_ds_with_header):

    verify_file_contents_to_debug(rank_ds_with_header)

    train_data = lgb.Dataset(rank_ds_with_header.data, params={
        "parser_config_file": rank_ds_with_header.parser_config, "header": True})
    bst = lgb.train(params, train_data, valid_sets=[train_data])
    pred = bst.predict(rank_ds_with_header.data)
    print(pred)
    np.testing.assert_allclose(pred[:5], np.array([0.83267298, 0.388454, 0.35369267, 0.60330376, -1.24218415]))

def test_rank_data_with_header_and_label_name(params, rank_ds_with_header):

    verify_file_contents_to_debug(rank_ds_with_header)

    # # works halfway;  picks right index but fails to train  
    # params['label'] = "name:Rating"
    # params['query'] = "0" 

    ##doesn't work
    params['label'] = "name:Rating"
    params['query'] = "name:DocId"

    train_data = lgb.Dataset(rank_ds_with_header.data, params={
        "parser_config_file": rank_ds_with_header.parser_config, "header": True})#.construct()

    bst = lgb.train(params, train_data, valid_sets=[train_data])
    pred = bst.predict(rank_ds_with_header.data)
    np.testing.assert_allclose(pred[:5], np.array([0.83267298, 0.388454, 0.35369267, 0.60330376, -1.24218415]))


def verify_file_contents_to_debug(ds):
   # to verify if header is actually added to the 
    with open(ds.data, 'r') as f:
        print(f.read()[:400])

    import json
    parser = json.load(open(ds.parser_config))
    print("parser.keys():")
    print(parser.keys())



Freeforms issue

Description

When trying to leverage freeforms in learning to rank scenario it seems they are working only when the features used in them are also defined in parser config file as linear Transforms (last example from the attached notebook). Using freeforms with features not defined as standalone seems like the ranker does not use them at all.

Reproducible example

ff_example.zip

Environment info

lightgbm-transform version 3.3.1

Command(s) you used to install lightgbm-transform

Using docker image with
RUN pip install --upgrade pip setuptools wheel && \ pip install 'cmake==3.21.0' && \ pip install lightgbm==3.3.1 --install-option=--mpi &&\ pip install lightgbm-transform==3.3.1

Additional Comments

lightgbm.basic.LightGBMError: Label should be the first column in a LibSVM file

If both train data and valid data exists, below exception will be raised up when constructing valid data.
It will succeed if only train data exists.

 File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/engine.py", line 275, in train
    booster.add_valid(valid_set, name_valid_set)
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 2945, in add_valid
    data.construct().handle))
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 1811, in construct
    feature_name=self.feature_name, params=self.params)
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 1528, in _lazy_init
    ctypes.byref(self.handle)))
  File "/opt/miniconda/lib/python3.7/site-packages/lightgbm/basic.py", line 132, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Label should be the first column in a LibSVM file

Adding Group Information for 'lambdarank' Model

Hello,
I have a question regarding training lamdarank model. I want to pass group information by using 'query' parameter but I receive
"Number of rows ... exceeds upper limit of 10000 for a query"
error. It seems like my model does not use query information. I am not sure if other group column parameters in lightgbm are supported since all of the examples in this repo uses 'query' parameter. This is how I use it:

params = {
        #other params for training
          'query': index #where index is the column index I want to use for grouping
    } 

Then I pass params to training. Do you have any idea why my group information is not accepted or is there any alternative way to give group information that is supported?

Thank you!

Update lightgbm-transform to support lightgbm v4.1.0

Description

Update lightgbm-transform to support lightgbm v4.1.0

Reproducible example

Environment info

lightgbm-transform version or commit hash: 3.3.2

Command(s) you used to install lightgbm-transform

I have setup a docker container using instructions in read me, installed lightgbm(supported version), lightgbm-transform

and ran pytest.

with Lightgbm 3.3.0:
image

and with lightgbm4.1:
MicrosoftTeams-image (1)

Additional Comments

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.