Giter Site home page Giter Site logo

entityduetneuralranking's Introduction

Entity-Duet Neural Ranking Model

There are source codes for Entity-Duet Neural Ranking Model (EDRM) Paper.

model

Baselines

There are codes for our main baselines: K-NRM and Conv-KNRM.

EDRM

There are codes for our work based on Conv-KNRM.

Results

The ranking results. All results are in trec format.

Method Testing-SAME (NDCG@1) Testing-SAME (NDCG@10) Testing-DIFF (NDCG@1) Testing-DIFF (NDCG@10) Testing-RAW (MRR)
K-NRM 0.2645 0.4197 0.3000 0.4228 0.3447
Conv-KNRM 0.3357 0.4810 0.3384 0.4318 0.3582
EDRM-KNRM 0.3096 0.4547 0.3327 0.4341 0.3616
EDRM-CKNRM 0.3397 0.4821 0.3708 0.4513 0.3892

Results on ClueWeb09 and CluWeb12. All models are trained on Anchor-Doc pairs in ClueWeb. These results only leverage entity embedding and entity description. For EDRM of English version, please refer to our OpenMatch tookit.

ClueWeb09:

Method NDCG@20 ERR@20
Conv-KNRM 0.2893 0.1521
EDRM 0.2922 0.1642

ClueWeb12:

Method NDCG@20 ERR@20
Conv-KNRM 0.1142 0.0930
EDRM 0.1183 0.0968

Citation

@inproceedings{liu2018EntityDuetNR,
  title={Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval},
  author={Zhenghao Liu and Chenyan Xiong and Maosong Sun and Zhiyuan Liu},
  booktitle={Proceedings of ACL},
  year={2018}
}

Contact

If you have questions, suggestions and bug reports, please email

entityduetneuralranking's People

Contributors

edwardzh avatar zkt12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entityduetneuralranking's Issues

Add classical ranking signal

I have noticed that most DL matching paper focuses on semantic matching which makes sense because embedding brings a new way to find support in words from snippet not present in the query (but still similar).

However, for some strange reasons other signals are totally forgotten in papers, even when in the benchmarks, SVMRank is used as a baseline with classical features are used.

Since 2016 I have seen many papers from recsys using these classical signals in nnet. For that they bucket the continuous values making them categorical values, and then... attributes embedding to each categorical value. After that... it depends of the model, but it seems that MLP would be an Ok solution (Linear + Relu + Dropout).

Most important signal in search are age/date of the publication, past popularity (short and long term in clicks for instance), and type of content. The first two variables are continuous variables requiring bucketing.

I am wondering if in your work you plan to focus on adding these signals?

Hello, where can I get the content of the corresponding "sogou-id"?

Hi , as the picture shows that I have printed the content of the file of "test_raw.json" hosted in the pan of tsinghua,

dc3d9b108a2b78054cbd99f0fe98b03a

However, it seems that a lot of content corresponding to the "sogou-id" can't be found from any of those files stored in https://cloud.tsinghua.edu.cn/d/1bfff521fd784b95ac45/?p=%2Fjson&mode=list

So I want to ask where can I find the content corresponding to the "sougou-id", e.g. "sogou-14837"?

Thanks in advance.

train_ent_des_expansion文件格式问题

您好,您对于train_ent_des_expansion文件的格式描述是这样子的:
query ids \t document ids \t qurey entities \t document entities
但是我下载了文件之后发现数据集中的数据是这样子的(列举一条):
11347,11347,8 11347,11347,5057,8 11347,11347,8,407,435,917,56,562,56,1927,56,5058,56,2262,56,59,90,155 -0.195402 蹦蹦网 蹦,蹦 蹦蹦网,购物,电影,信息平台

所以看不懂您写的什么意思?是不是可以这样理解:
11347,11347,8 是query words ids
1347,11347,5057,8 是负样本 document words ids
11347,11347,8,407,435,917,56,562,56,1927,56,5058,56,2262,56,59,90,155 是正样本 document words ids
-0.195402是文档得分?这个得分怎么来的您也没有说明
蹦蹦网 是query中的实体
蹦,蹦 是负样本文档中的实体
蹦蹦网,购物,电影,信息平台 是正样本文档中的实体

我可以不可以是这样子的数据格式

Add batch norm for a nice boost

As explained in previous messages, my training stops learning quite rapidly (after seeing few thousands queries).
Performance are good but I had the feeling I got some overfitting and regularization may help.
I tried many tricks including, as said in previous message, dropout.
Finally I added batch norm on each CNN -> +8 abs points on MAP!! (I manually checked the results after)
I have of course absolutely no idea how well it generalizes to other dataset, just saw a small boost on Wikiqa dataset but training loss reached 0 during the first epoch, so I think it s a slightly too strong regul for such a small dataset.
May be you want to try it on SOGOU and Bing logs...

when run train.py, it pops up "No such file or directory: 'TRAIN' ".

Hi, I have downloaded all files provided by "https://cloud.tsinghua.edu.cn/d/1f57be663018465ab0ad/?p=%2F&mode=list". Then I run "python2 train.py -data ~/data/ednr/data.pt -train TRAIN -valid VALID -test_data ~/data/ednr/test.pt -save_model ~/data/ednr/save -save_mode best". But it pops up an error of "IOError: [Errno 2] No such file or directory: 'TRAIN'". All the files from the data website are "data.pt, tag_result, test.pt, title.emb". Is data.pt is the file of train.py? I copied data.pt to a file named "TRAIN", it also went wrong.

How the loss manages labels from click models

In the examples provided with the code the labels are 1 or 0.
In the 3 papers (KRNN, CNN, entity duet), you are training and testing on logs where the relevance is computed by a click model (meaning label are not anymore 1 or 0 but some floats between 0 and 1). However in the 3 papers it is said that the loss is the standard pairwise learning (and the code matches that point).

I am wondering how you take into account the relevance labels (when they are floats between 0 and 1)?

In KRNN paper it is said that the relevance scores are mapped to relevance grades. Does it mean that you generate pairs like : (grade 4 as positive doc Vs grade 3 negative doc) ?

CUDA out of memory

您好:我用pytorch1.4版本跑您的代码发现跑到几千个step之后就会出现CUDA out of memory
image
方便说一下您代码的环境配置吗?

Freezing embedding layer

I am running KRNM and CKNRM on our own log.
I learned the embeddings on my own dataset (the real documents) with fasttext.

As in your experiences (and other papers), I get a nice boost from CKNRM compared to KRNM.
However, the result is approximatively the same when the first layer is frozen.
Moreover, the max perf is reached on dev very rapidly (during the first epoch) and then it stays approximatively the same for several epochs (when embedding layer is frozen or not).

image
Dev in blue, test in red

Because of large bias in our logs, I needed to limit them to 100K queries (and there are 20 docs per SEPR).

Have you noticed the same behaviour on your log datasets? (rapid reach of max + no effect when freezing first layer). Have you tried to add more regularization? (I tried some 50% dropout on embedding layers without real effect)

Some questions about EDRM's input

Hi , I have some questions about EDRM's input. I've download CN-DBpedia and found it contains data of triples. How can I get entity description and entity type? And in the "EDRM/preprocess.py" , what is the "-ent_car" ? What kind of data format do I need ?
Looking forward to your reply and thanks!

环境配置

可以提供详细的环境配置吗?
pytorch,python的版本信息

Next work -> snippet?

I have noticed that like few other teams you are focusing on finding entities and trying to match entities from query to entity from document.
It seems to me that no one is working on snippet because the dataset everybody is working on already have them, or teams just use document titles.

In my own exp with your implementations, I have noticed that the way I build snippet has large impact on the performance. In particular too long or too short snippet have too many or too few info, and it s quite obvious that words around the matching ones can provide lots of signal. Of course it s pure feature engineering for a supposed end to end learned model (requires to define where end to end starts).
In production environment you mostly have access to the full document, and when you build snippet you decide in some way how much contextual information you are adding, this has (in my case) lots of impact.
May be for your next work, if it still about ranking, you may want to work on this aspect :-)

Separate matching on title and on Snippet

I have separated the matching Query Vs Title and Query Vs Snippet
It has increased the inference time on 10 cores CPU from 50ms to 67ms (still manageable).
MAP (model with CNN) improved from 0.31 to 0.36 (with raw click on a search engine using BM25).
For what it worths, model without CNN (not separating title and snippet text) had a 0.27 MAP... meaning in my case, separating matching on title and snippet text improves perf in a similar way than adding CNN.

You may want to test this approach on SOGOU / Bing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.