Official software repository of the ACM SIGIR 2022 resource paper: "The Istella22 Dataset: Bridging Traditional and Neural Learning to Rank Evaluation" by D. Dato et al.
First of all thank you for this interesting work (I enjoyed reading the paper a lot)! After cloning and executing the project run_monot5.py we obtained the following results from the cached run files monoT5/runs:
name
P@1
P@5
P@10
nDCG@10
nDCG@20
RR
AP
MonoT5 fine-tuned title+url
0.8412
0.5991
0.3914
0.6858
0.7087
0.9025
0.7396
MonoT5 fine-tuned title+url+text
0.8581
0.5945
0.3910
0.7034
0.7268
0.9132
0.7462
Which are notably higher values in terms of nDCG than the values reported in the paper (which are โ0.45). A student of mine also re-ran the T5 models published on huggingface without the run caching and reported similarly diverging values.
I just wanted to highlight this finding. Do you have any idea where these values are coming form?