Giter Site home page Giter Site logo

jiangnanboy / learning_to_rank Goto Github PK

View Code? Open in Web Editor NEW
245.0 2.0 72.0 2.05 MB

利用lightgbm做(learning to rank)排序学习,包括数据处理、模型训练、模型决策可视化、模型可解释性以及预测等。Use LightGBM to learn ranking, including data processing, model training, model decision visualization, model interpretability and prediction, etc.

Python 100.00%
lightgbm learning-to-rank python shap

learning_to_rank's Introduction

利用lightgbm做learning to rank 排序,主要包括:

  • 数据预处理
  • 模型训练
  • 模型决策可视化
  • 预测
  • ndcg评估
  • 特征重要度
  • SHAP特征贡献度解释
  • 样本的叶结点输出

(要求安装lightgbm、graphviz、shap等)

一.data format (raw data -> (feats.txt, group.txt))

python lgb_ltr.py -process
1.raw_train.txt

0 qid:10002 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042 #docid = GX008-86-4444840 inc = 1 prob = 0.086622

0 qid:10002 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 #docid = GX037-06-11625428 inc = 0.0031586555555558 prob = 0.0897452 ...

2.feats.txt:

0 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042

0 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 ...

3.group.txt:

8

8

8

8

8

16

8

118

16

8

...

二.model train (feats.txt, group.txt) -> train -> model.mod

python lgb_ltr.py -train
train params = {
        'task': 'train',  # 执行的任务类型
        'boosting_type': 'gbrt',  # 基学习器
        'objective': 'lambdarank',  # 排序任务(目标函数)
        'metric': 'ndcg',  # 度量的指标(评估函数)
        'max_position': 10,  # @NDCG 位置优化
        'metric_freq': 1,  # 每隔多少次输出一次度量结果
        'train_metric': True,  # 训练时就输出度量结果
        'ndcg_at': [10],
        'max_bin': 255,  # 一个整数,表示最大的桶的数量。默认值为 255。lightgbm 会根据它来自动压缩内存。如max_bin=255 时,则lightgbm 将使用uint8 来表示特征的每一个值。
        'num_iterations': 200,  # 迭代次数,即生成的树的棵数
        'learning_rate': 0.01,  # 学习率
        'num_leaves': 31,  # 叶子数
        'max_depth':6,
        'tree_learner': 'serial',  # 用于并行学习,‘serial’: 单台机器的tree learner
        'min_data_in_leaf': 30,  # 一个叶子节点上包含的最少样本数量
        'verbose': 2  # 显示训练时的信息
    }
  • docs:7796
  • groups:380
  • consume time : 4 seconds
  • training's ndcg@10: 0.940891
1.model.mod(model的格式在data/model/mode.mod)

训练时的输出:

  • [LightGBM] [Info] Total Bins 9171
  • [LightGBM] [Info] Number of data: 7796, number of used features: 40
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
  • [1] training's ndcg@10: 0.791427
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
  • [2] training's ndcg@10: 0.828608
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
  • ...
  • ...
  • ...
  • [198] training's ndcg@10: 0.941018
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
  • [199] training's ndcg@10: 0.941038
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
  • [200] training's ndcg@10: 0.940891
  • consume time : 4 seconds

三.模型决策过程的可视化生成

可指定树的索引进行可视化生成,便于分析决策过程。

python lgb_ltr.py -plottree

image

四.predict 数据格式如feats.txt,当然可以在每行后面加一个标识(如文档编号,商品编码等)作为排序的输出,这里我直接从test.txt中得到feats与comment作为predict

python lgb_ltr.py -predict
1.predict results
  • ['docid = GX252-32-5579630 inc = 1 prob = 0.190849'
  • 'docid = GX108-43-5342284 inc = 0.188670948386237 prob = 0.103576'
  • 'docid = GX039-85-6430259 inc = 1 prob = 0.300191' ...,
  • 'docid = GX009-50-15026058 inc = 1 prob = 0.082903'
  • 'docid = GX065-08-0661325 inc = 0.012907717401617 prob = 0.0312699'
  • 'docid = GX012-13-5603768 inc = 1 prob = 0.0961297']

五.validate ndcg 数据来自test.txt(data from test.txt)

python lgb_ltr.py -ndcg

all qids average ndcg: 0.761044123343

六.features 打印特征重要度(features importance)

python lgb_ltr.py -feature

模型中的特征是"Column_number",这里打印重要度时可以映射到真实的特征名,比如本测试用例是46个feature

1.features importance
  • feat0name : 228 : 0.038
  • feat1name : 22 : 0.0036666666666666666
  • feat2name : 27 : 0.0045
  • feat3name : 11 : 0.0018333333333333333
  • feat4name : 198 : 0.033
  • feat10name : 160 : 0.02666666666666667
  • ...
  • ...
  • ...
  • feat37name : 188 : 0.03133333333333333
  • feat38name : 434 : 0.07233333333333333
  • feat39name : 286 : 0.04766666666666667
  • feat40name : 169 : 0.028166666666666666
  • feat41name : 348 : 0.058
  • feat43name : 304 : 0.050666666666666665
  • feat44name : 283 : 0.04716666666666667
  • feat45name : 220 : 0.03666666666666667

七.利用SHAP值解析模型中特征重要度

python lgb_ltr.py -shap

这里不同于六中特征重要度的计算,而是利用博弈论的方法--SHAP(SHapley Additive exPlanations)来解析模型。 利用SHAP可以进行特征总体分析、多维特征交叉分析以及单特征分析等。

1.总体分析

image

image

2.多维特征交叉分析

image

3.单特征分析

image

八.利用模型得到样本叶结点的one-hot表示,可以用于像gbdt+lr这种模型的训练

python lgb_ltr.py -leaf

这里测试用例是test/leaf.txt 5个样本

[

  • [ 0. 1. 0. ..., 0. 0. 1.]
  • [ 1. 0. 0. ..., 0. 0. 0.]
  • [ 0. 0. 1. ..., 0. 0. 1.]
  • [ 0. 1. 0. ..., 0. 1. 0.]
  • [ 0. 0. 0. ..., 1. 0. 0.] ]

九.REFERENCES

https://github.com/microsoft/LightGBM

https://github.com/jma127/pyltr

https://github.com/slundberg/shap

contact

如有搜索、推荐、nlp以及大数据挖掘等问题或合作,可联系我:

1、我的github项目介绍:https://github.com/jiangnanboy

2、我的博客园技术博客:https://www.cnblogs.com/little-horse/

3、我的QQ号:2229029156

learning_to_rank's People

Contributors

jiangnanboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

learning_to_rank's Issues

maybe wrong

In predict,
test_ X, test_ y, test_ qids, comments = load_ data_ from_ raw(raw_ data_path)
Should it be changed to
test_ X, test_ y, test_ qids, comments = load_ data_ from_ raw(predict_ data_ path)?

数据据

这个数据集应该怎么构造,用自己的数据

lgbmranker如何應用於每個客戶推薦

您好,想請問lgbmranker的outcome -- probability, 是針對該物品和什麼的相關性分數?
另想請問,模型predict不包含group的概念,那我該如何針對每個用戶不同的歷史行為進行客製化的推薦?謝謝您

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.