thunlp / entityduetneuralranking Goto Github PK

View Code? Open in Web Editor NEW

153.0 10.0 20.0 14.17 MB

Entity-Duet Neural Ranking Model

License: MIT License

Python 76.59% Shell 1.40% Perl 22.02%

information-retrieval

entityduetneuralranking's Introduction

Entity-Duet Neural Ranking Model

There are source codes for Entity-Duet Neural Ranking Model (EDRM) Paper.

Baselines

There are codes for our main baselines: K-NRM and Conv-KNRM.

EDRM

There are codes for our work based on Conv-KNRM.

Results

The ranking results. All results are in trec format.

Method	Testing-SAME (NDCG@1)	Testing-SAME (NDCG@10)	Testing-DIFF (NDCG@1)	Testing-DIFF (NDCG@10)	Testing-RAW (MRR)
K-NRM	0.2645	0.4197	0.3000	0.4228	0.3447
Conv-KNRM	0.3357	0.4810	0.3384	0.4318	0.3582
EDRM-KNRM	0.3096	0.4547	0.3327	0.4341	0.3616
EDRM-CKNRM	0.3397	0.4821	0.3708	0.4513	0.3892

Results on ClueWeb09 and CluWeb12. All models are trained on Anchor-Doc pairs in ClueWeb. These results only leverage entity embedding and entity description. For EDRM of English version, please refer to our OpenMatch tookit.

ClueWeb09:

Method	NDCG@20	ERR@20
Conv-KNRM	0.2893	0.1521
EDRM	0.2922	0.1642

ClueWeb12:

Method	NDCG@20	ERR@20
Conv-KNRM	0.1142	0.0930
EDRM	0.1183	0.0968

Citation

@inproceedings{liu2018EntityDuetNR,
  title={Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval},
  author={Zhenghao Liu and Chenyan Xiong and Maosong Sun and Zhiyuan Liu},
  booktitle={Proceedings of ACL},
  year={2018}
}

Contact

If you have questions, suggestions and bug reports, please email

[email protected].

entityduetneuralranking's People

Contributors

Stargazers

Watchers

entityduetneuralranking's Issues

Add classical ranking signal

I have noticed that most DL matching paper focuses on semantic matching which makes sense because embedding brings a new way to find support in words from snippet not present in the query (but still similar).

However, for some strange reasons other signals are totally forgotten in papers, even when in the benchmarks, SVMRank is used as a baseline with classical features are used.

Since 2016 I have seen many papers from recsys using these classical signals in nnet. For that they bucket the continuous values making them categorical values, and then... attributes embedding to each categorical value. After that... it depends of the model, but it seems that MLP would be an Ok solution (Linear + Relu + Dropout).

Most important signal in search are age/date of the publication, past popularity (short and long term in clicks for instance), and type of content. The first two variables are continuous variables requiring bucketing.

I am wondering if in your work you plan to focus on adding these signals?

Hello, where can I get the content of the corresponding "sogou-id"?

Hi , as the picture shows that I have printed the content of the file of "test_raw.json" hosted in the pan of tsinghua,

However, it seems that a lot of content corresponding to the "sogou-id" can't be found from any of those files stored in https://cloud.tsinghua.edu.cn/d/1bfff521fd784b95ac45/?p=%2Fjson&mode=list

So I want to ask where can I find the content corresponding to the "sougou-id", e.g. "sogou-14837"?

Thanks in advance.

train_ent_des_expansion文件格式问题

您好，您对于train_ent_des_expansion文件的格式描述是这样子的：
query ids \t document ids \t qurey entities \t document entities
但是我下载了文件之后发现数据集中的数据是这样子的（列举一条）：
11347,11347,8 11347,11347,5057,8 11347,11347,8,407,435,917,56,562,56,1927,56,5058,56,2262,56,59,90,155 -0.195402 蹦蹦网蹦,蹦蹦蹦网,购物,电影,信息平台

所以看不懂您写的什么意思？是不是可以这样理解：
11347,11347,8 是query words ids
1347,11347,5057,8 是负样本 document words ids
11347,11347,8,407,435,917,56,562,56,1927,56,5058,56,2262,56,59,90,155 是正样本 document words ids
-0.195402是文档得分？这个得分怎么来的您也没有说明
蹦蹦网是query中的实体
蹦,蹦是负样本文档中的实体
蹦蹦网,购物,电影,信息平台是正样本文档中的实体

我可以不可以是这样子的数据格式

How to map entity mention to entity in CN-DBpedia?

Add batch norm for a nice boost

As explained in previous messages, my training stops learning quite rapidly (after seeing few thousands queries).
Performance are good but I had the feeling I got some overfitting and regularization may help.
I tried many tricks including, as said in previous message, dropout.
Finally I added batch norm on each CNN -> +8 abs points on MAP!! (I manually checked the results after)
I have of course absolutely no idea how well it generalizes to other dataset, just saw a small boost on Wikiqa dataset but training loss reached 0 during the first epoch, so I think it s a slightly too strong regul for such a small dataset.
May be you want to try it on SOGOU and Bing logs...

when run train.py, it pops up "No such file or directory: 'TRAIN' ".

Hi, I have downloaded all files provided by "https://cloud.tsinghua.edu.cn/d/1f57be663018465ab0ad/?p=%2F&mode=list". Then I run "python2 train.py -data ~/data/ednr/data.pt -train TRAIN -valid VALID -test_data ~/data/ednr/test.pt -save_model ~/data/ednr/save -save_mode best". But it pops up an error of "IOError: [Errno 2] No such file or directory: 'TRAIN'". All the files from the data website are "data.pt, tag_result, test.pt, title.emb". Is data.pt is the file of train.py? I copied data.pt to a file named "TRAIN", it also went wrong.

How the loss manages labels from click models

In the examples provided with the code the labels are 1 or 0.
In the 3 papers (KRNN, CNN, entity duet), you are training and testing on logs where the relevance is computed by a click model (meaning label are not anymore 1 or 0 but some floats between 0 and 1). However in the 3 papers it is said that the loss is the standard pairwise learning (and the code matches that point).

I am wondering how you take into account the relevance labels (when they are floats between 0 and 1)?

In KRNN paper it is said that the relevance scores are mapped to relevance grades. Does it mean that you generate pairs like : (grade 4 as positive doc Vs grade 3 negative doc) ?

CUDA out of memory

您好：我用pytorch1.4版本跑您的代码发现跑到几千个step之后就会出现CUDA out of memory

方便说一下您代码的环境配置吗？

Which part of the EDRM code introduces knowledge graph information

I have a question. Knowledge graph information is introduced in the EDRM model architecture, but there is no knowledge graph data in the model training data. Which part of the code introduces the knowledge graph data?

test_diff.qrel; test_same.qrel文件没有提供

您好，您的在datahttps://github.com/thunlp/EntityDuetNeuralRanking/tree/master/data中说所有的文件都可以在此链接https://cloud.tsinghua.edu.cn/d/1f57be663018465ab0ad/?p=%2F&mode=list中下载，但是这个链接中并没有提供test_diff.qrel; test_same.qrel这两个文件。
您链接中提供的所有数据如下：

Freezing embedding layer

I am running KRNM and CKNRM on our own log.
I learned the embeddings on my own dataset (the real documents) with fasttext.

As in your experiences (and other papers), I get a nice boost from CKNRM compared to KRNM.
However, the result is approximatively the same when the first layer is frozen.
Moreover, the max perf is reached on dev very rapidly (during the first epoch) and then it stays approximatively the same for several epochs (when embedding layer is frozen or not).

Dev in blue, test in red

Because of large bias in our logs, I needed to limit them to 100K queries (and there are 20 docs per SEPR).

Have you noticed the same behaviour on your log datasets? (rapid reach of max + no effect when freezing first layer). Have you tried to add more regularization? (I tried some 50% dropout on embedding layers without real effect)

Some questions about EDRM's input

Hi , I have some questions about EDRM's input. I've download CN-DBpedia and found it contains data of triples. How can I get entity description and entity type? And in the "EDRM/preprocess.py" , what is the "-ent_car" ? What kind of data format do I need ?
Looking forward to your reply and thanks!

环境配置

可以提供详细的环境配置吗？
pytorch,python的版本信息

Next work -> snippet?

I have noticed that like few other teams you are focusing on finding entities and trying to match entities from query to entity from document.
It seems to me that no one is working on snippet because the dataset everybody is working on already have them, or teams just use document titles.

In my own exp with your implementations, I have noticed that the way I build snippet has large impact on the performance. In particular too long or too short snippet have too many or too few info, and it s quite obvious that words around the matching ones can provide lots of signal. Of course it s pure feature engineering for a supposed end to end learned model (requires to define where end to end starts).
In production environment you mostly have access to the full document, and when you build snippet you decide in some way how much contextual information you are adding, this has (in my case) lots of impact.
May be for your next work, if it still about ranking, you may want to work on this aspect :-)

Hard coded value?

Hi,

I have noticed that a padding value is hard coded for tests documents.
https://github.com/thunlp/EntityDuetNeuralRanking/blob/master/baselines/DataLoader.py#L131

When I remove it there it crashes.
Do you have an idea why? It seems related to the squeeze op.
A way to workaround this bug is to guarantee a minimum value for the query (which should be >= 4, don t ask me why, just made some tests)

Separate matching on title and on Snippet

I have separated the matching Query Vs Title and Query Vs Snippet
It has increased the inference time on 10 cores CPU from 50ms to 67ms (still manageable).
MAP (model with CNN) improved from 0.31 to 0.36 (with raw click on a search engine using BM25).
For what it worths, model without CNN (not separating title and snippet text) had a 0.27 MAP... meaning in my case, separating matching on title and snippet text improves perf in a similar way than adding CNN.

You may want to test this approach on SOGOU / Bing.

Improve model code readability

Little thing:
code starting here can easily be replaced by a nested loop:

EntityDuetNeuralRanking/baselines/CKNRM.py

Line 90 in 7bbac77

    
           log_pooling_sum_wwuu = self.get_intersect_matrix(qwu_embed_norm, dwu_embed_norm, mask_qwu, mask_dwu)

Helps to read and understand the purpose (and easily try with more CNN)

what is "car_emb" the meaning in EDRM/KNRM.py

hi EdwardZH , what is "car_emb" the meaning in EDRM/KNRM.py, type embedding?

thunlp / entityduetneuralranking Goto Github PK

entityduetneuralranking's Introduction

Entity-Duet Neural Ranking Model

Baselines

EDRM

Results

Citation

Contact

entityduetneuralranking's People

Contributors

Stargazers

Watchers

Forkers

entityduetneuralranking's Issues

Recommend Projects

Recommend Topics

Recommend Org