Giter Site home page Giter Site logo

proxlogprf's Introduction

ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback

This is the official repository of the manuscript "ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback" submitted to Information Processing & Management (IP&M).

Updates

  • Jan 23, 2022: the project has been released to GitHub.

Requirements

  • java 1.7 -- development environment
  • some necessary *.jar packages -- we include them in the 'lib' folder.
  • pandas -- used by trecEval.py

Experimentation instructions

  • Step 1: create the index for each data collection (e.g. WT2G).

    $ javac -cp "lib/*:" ./*.java                                             # compile
    $ java -cp "lib/*:" index.IndexTREC -docs datasets/WT2G/ -data WT2G       # create index for WT2G
    
  • Step 2: retrieve the documents using each model (e.g. BM25). ps: Step 2 generates the ranking results and outputs them to a file named *-report.txt under 'ProxLogPRF/result'. In addition, the *.txt files containing the metric results (i.e. MAP and P@k) will also be generated under 'ProxLogPRF/result'.

    $ javac -cp "lib/*:" ./*.java                                             # compile
    $ java -cp "lib/*:" ProxPRF.BM25 -k1 1.2 -b 0.35                          # use BM25 model to retrieve documents
    $ java -cp "lib/*:" ProxPRF.BM25 -h                                       # use this command to check arguments usages
    
  • Step 3: evaluate the retrieval model We use trecEval.py to evaluate the model performance via MAP, P@k, nDCG and nDCG@k.

    $ python trecEval.py result/BM25/WT2G-BM25-1.2-0.35-report.txt query-judge/qrels.WT2G result/WT2G-BM25-1.2-0.35.xls
    

Package structure

  • analyzer
    • MyStopAndStemmingAnalyzer.java: stopwords removal and stemming
  • common
    • ByWeightComparator.java -- numerical comparator.
    • MyQQParser.java -- simplistic quality query parser.
    • MyTrecParser.java -- TREC document analyzer
    • QualityStats.java -- compute the results (MAP, P@k and MRR) of quality benchmark run for a single query or for a set of queries.
    • StaTools.java -- implementation on some basic statistical functions
  • datasets -- directory to the data collections
  • index
    • IndexTREC.java -- create index for data collections
  • indices -- directory to the files containing index of each data collection
  • lib -- directory to all the *.jar packages used for the project
  • models -- directory to all the retrieval models (i.e. BM25, DLM, LL, LLPRF, LLEXPStar (LL+EXP*), PRoc2, PRoc3 and ProxLogPRF)
  • query-judge -- directory to all the query topics
  • result -- directory to the experimental results
  • stopwords.txt -- stopwords used in our experiments
  • trecEval.py -- evaluation metrics

Data collections

We tested baselines, SOTA proximity-based PRF models and our model variants on eight standard TREC collections, namely AP (Associated Press 1988-90), DISK1&2, DISK4&5, ROBUST04 (TREC Robust Track 2004), WSJ (Wall Street Journal), WT2G (TREC Web Track 2000), WT10G (TREC Web Track 2001- 2002) and GOV2. Note that AP, DISK1&2, DISK4&5, ROBUST04 and WSJ are popular newswire collections where noise is rare, while WT2G, WT10G and GOV2 are collections consisting of web documents with inherent noises.

Acknowledgments

This research is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, the York Research Chairs (YRC) program, NSERC CREATE award and an ORF-RE (Ontario Research Fund Research Excellence award in BRAIN Alliance.

proxlogprf's People

Contributors

jeremyleiliu avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.