castorini / anserini Goto Github PK

View Code? Open in Web Editor NEW

981.0 41.0 411.0 90.63 MB

Anserini is a Lucene toolkit for reproducible information retrieval research

Home Page: http://anserini.io/

License: Apache License 2.0

Java 84.73% Python 14.68% Shell 0.24% TeX 0.03% Julia 0.08% JavaScript 0.01% TypeScript 0.09% CSS 0.14%

information-retrieval lucene

anserini's People

Contributors

Stargazers

Watchers

Forkers

iorixxx luchentan xingniu aroegies hatianzhang claclark sashavtyurina dyshi ylwang99 ahmed-elbagoury-zz anukat2015 khui yogosling youngbink saikrishnar khaledalbishre zuacubd peilin-yang rosequ snapbug tuzhucheng gauravbaruah lcschv shadowridgedev salman1993 jackzhangjc meowfei ptkyldz impavidity mengf821 victor0118 kytabyte tiddler amallia lukuang paopao74cn catenamatteo csirfan andrewyates matthew-z achyudh sebastian-hofstaetter entn-at tokee zhouyonglong awesome-archive codeaudit jpountz mkleen mdmustafizurrahman aalbahem atamborrino rodrigonogueira4 emmileaf edwardhdlu mam10eks zeynepakkalyoncu ricocotam jinfengr ronakice mathbunny tanjacrijns x389liu kelvin-jiang searchivarius tteofili surefirelin ewanpersonal w329li srihari-palivela rodrigo-eai kamyarghajar nihilistsumo mirzaeiyan mpkato canjiali lukuuu sarwar187 dragomirradev smarthi haiming2019 pombredanne jc-r polaris79 jmmackenzie guyrosin mannystockman kevinxyc1 nikhilro boudinfl ji-xin themagicalmish lindw zkt12 apoorvp16 kohei-shinden ewrfcas gzm9583 alxliu akshatabhat

anserini's Issues

Refactor Twitter NRTS package to use a templating system

For the Twitter NRTS demo, we should probably use a proper templating system like mustache to avoid System.out.println nightmare: https://github.com/spullara/mustache.java

Implement RM3

We probably need some relevance feedback model... RM3 is probably our best bet.

Simple LTR implementation for Tweets

@LuchenTan @xeniaqian94 Let's start with a simple two-feature LTR implementation for Tweets:

Start with the current tweet search implementation, which has two rerankers, rm3 and cleanup.
Your implementation is going to go in a third reranker that you tack on to the end.

Let's build a LTR implementation that just has two features: RM3 score + number hash tags. Inside your new reranker, you already have the RM3 score; use getField on the document to pull out the text, and then just count the number of hashtags. Print out a line like this:

1 325263 0.432 3

Topic 1, docid 325263, RM3 score of 0.432, 3 hashtags. Dump this information for all docs.

You'll need to take this file and join it with qrels to get the relevance judgments (i.e., write a simple Python script to do it. So you'll end up with a file like:

1 325263 0.432 3 1

The final column is the relevance judgment. Now you can run learning to rank using http://sourceforge.net/p/lemur/wiki/RankLib/

Merge segments after indexing

We should probably merge all segments into one, perhaps with a -optimize flag.

Try out different analyzers on Tweets collection

@xeniaqian94 It would be great for you to get some experience running end-to-end ad hoc experiments, which is a core activity of IR research. Let's start with something simple, like playing with different analyzers - currently, the tweet indexing uses PorterStemFilter. Try removing it and see what the effect is. So:

change the analyzer to remove stemming
rebuild index
run retrieval experiments - report effects on MAP, P@30 (compare with original index).

It would be nice to also know the effects of indexing only English tweets, using same procedure above.

Implement indexing for selective search

In selective search, the document collection is divided into different partitions (e.g., by clustering). Write an indexer that takes a cluster mapping (docid to clusterid mapping) and builds the right indexes - i.e., puts the documents in the appropriate partition index.

Refactor ClueWeb09b to parallel structure of IndexGov2

@iorixxx Please check out my branch cw09b-refactoring

I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:

Clean up various reference (e.g., pom.xml) to make sure everything still works?
Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?

Thanks!

Implement Tweets2011/2012 baseline

Get basic indexing/retrieval working on TREC Microblog track data from 2011 to 2014. Let's start with TREC 2011 and TREC 2012 microblog data since the corpus is smaller...

Integrate CACM collection

The CACM collection is small enough that we can include it in the repository... so we can have indexing/retrieval experiments completely integrated in with the system.

Put example command to dump LTR feaures into the README.md

specifically

sh target/appassembler/bin/DumpTweetsLtrData -index tweets2011-index/ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -output ltr.data.txt -qrels src/main/resources/topics-and-qrels/qrels.microblog2011.txt -ql

Bug in FeatureVector.normalizeToOne (RM3 results are wrong)

We should be dividing by the length of the vector here:
https://github.com/lintool/Anserini/blob/master/src/main/java/io/anserini/rerank/rm3/FeatureVector.java#L167

Since RM3 uses this, RM3 results are probably wrong.

Twitter tokenizer

Luchen is working on a Twitter tokenizer.

Write partitioned tweets indexer

Write an indexer that indexes tweets into multiple partitions - simple round robin strategy would be a reasonable start.

LTR data generator needs access to qrels

@claclark The feature data that the LTR module generates needs to have access to the qrels so the relevance grade can be folded directly into the output.

Baselines on ClueWeb09b

Let's implement baselines for ClueWeb09b, and then push it to all of ClueWeb09.

Incorporate Lintools for frequency distributions in RM3

Take advantage of classes for frequency distributions in Lintools: https://github.com/lintool/tools

Code Comments

Before we get too far into hacking on Anserini, we should probably decide on how we want to deal with comments.

Do we want to do Javadoc? Something else?

Issue with using QueryParser to parse TREC topics

We're currently using QueryParser to parse TREC topics, which means that symbols in the topics like parentheses and quotes get interpreted as query operators... this isn't the desired behavior.

Indexing all of ClueWeb09

Quite impressively, I was able to index all of ClueWeb09 (English):

nohup sh target/appassembler/bin/IndexClueWeb09b \
  -input /scratch1/collections/ClueWeb09.English/data/ \
  -index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &

Took ~18 hours:

2015-10-16 07:51:04,775 INFO  [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04

Index size (note: no positions):

$ du -h lucene-index.cw09.cnt/
254G    lucene-index.cw09.cnt/

Write an indexer for flat text files

@yb1 You probably want to dump out the cleaned text in a simple text format, something like this:

URL1 document1 ....
URL2 document2 ...

And write an indexer for it. Look at IndexTweets.java and IndexWebCollection.java here:
https://github.com/lintool/Anserini/tree/master/src/main/java/io/anserini/index

The tweets indexer should be fairly easy to understand - it's single-threaded so it's slower. IndexWebCollection is multi-threaded and thus much faster.

I would start with a single-threaded implementation. Call the class IndexPlainText or something like that.

Don't create reranker cascade for every query

We're currently being stupid and recreating a reranker cascade for every query. Don't do this. Create a cascade at the beginning and point at the context per query.

@claclark I assume you have no objections?

Experiment with different analyzers on Gov2

According to @iorixxx

EnglishAnalyzer: PorterStemmer is aggressive, and stop word removal would make certain queries (the wall, the current, the sun, to be or not to be) meaningless.
I think analysis should be minimum.

We should play with different analyzers and evaluate impact on effectiveness.

Baselines on ClueWeb12-B13

Let's implement baselines for ClueWeb12-B13, and then push it to all of ClueWeb12.

Refactoring of the index and document

For now everything is based on Warc formatted records.
We'd have other types of records too, e.g. Trec text or maybe other types in the future.
It is better to have a base record and everything inherits it.

Indexing with Trec Text and AQUAINT collections

Get simple end-to-end LTR training pipeline working with Ranklib

@claclark Ranklib is here: http://sourceforge.net/p/lemur/wiki/RankLib/

Massage our LTR data dumper to produce data files that can be directly read by Ranklib so we can have an end-to-end training/cross-validation pipeline.

Verify new code contributes by @iorixxx

@jiaul can you please verify the new code contributions from @iorixxx for indexing gov2, clueweb09, and clueweb12?

Prepare TST data for baselines by Anserini

From @aroegies

If you tell me what fields are desired from:
https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_3_0.thrift
Open questions:

Do we want raw HTML, cleaned up HTML, cleaned up visible only HTML?
Do we just want the sentences (e.g. for compatibility with TST eval)?
Do we want some combination of the above.
Likely don't want to save any of their tagging.
What metadata to retain though.

Then I should be able to quickly put together a script to re-crawl, format, and encode in JSON the documents.

Likely we just want to use the entire KBA dataset rather than the TST subset but whatever.

Nondeterminism in documents indexed for Gov2?

Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.

Weird - some non-determinism the multi-threading?

Not that important if we can replicate effective results on standard test collections, but worth noting.

Implement DocumentReranker interface

It seems what we need is a generic document reranking interface: takes a document ranking and spits another document ranking back out. This would implement a standard multi-stage retrieval pipeline: e.g., BM25 (or QL) + 1st stage reranker + 2nd stage reranker, etc.

provide baselines for Disk1-5 and AQUAINT

Like what we show in README.md for gov2, clueweb09 and clueweb12 it's better to have them available for disk1-5 and AQUAINT for reference.

Implement learning to rank features in Macdonald (CIKM 2012)

@lintool:

We need to add query features from Macdonald et al., CIKM 2012

"On the Usefulness of Query Features for Learning to Rank"

http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12queryf.pdf

Implement term dependence model

We probably also want to implement some type of phrasal query model... SDM seems like a good start.

Consider using ThreadPoolExecutor in Gov2 indexer

Instead of calling/invoking thread.start() manually, we can switch to ThreadPoolExecutor, which usually provides improved performance when executing large numbers of asynchronous tasks.

For Gov2 indexing: count vs. positional in cmdline args

For the Gov2 indexing class, we should probably add an option to force the user to specify whether they want count or positional index, as to reduce confusion?

Build Anserini API to Twitter Search API

Luchen needs this to build query relevance profiles.

Connect NRTS demo with RTS mobile push broker

@aroegies @xeniaqian94 Can you two coordinate on making this happen?

RTS mobile push broker: https://github.com/aroegies/trecrts-tools

Decide on a REST API so that the NRTS demo can call the broker
Note that the REST API should have the notion of a queryid, user, and token (=password)
Modify the NRTS demo so that you pass in a queryid and a query (e.g., "birthday") on the command line, and also an interval, e.g., 1 minute. Every minute, the NRTS demo wakes up, runs the query, and pushes results to the RTS broker

Create RerankerCascade abstraction

It would make sense to create a RerankerCascade abstraction for running a whole bunch of sequence of rerankers. Something like:

RerankerCascade cascade = new RerankerCascade(context).add(foo).add(bar);
cascade.run(docs);

Lucene query parser cannot parse wildcard queries

Lucene query parser gives the following error if the query has wildcard characters in it:
'*' or '?' not allowed as first character in WildcardQuery
Ex: Cannot parse 'where is the Eldorado Casino in Reno ?': '*' or '?' not allowed as first character in WildcardQuery.

Mock TwitterStream to read tweets from stored files

The TwitterStream class in twitter4j abstracts over the live Twitter stream. We need to find a way to mock the object so that we can "replay" a stream from previously-stored tweets.

Refactor Twitter NRTS package to use Jetty servlets

The current implementation of TweetSearcherServer parses raw HTTP requests.

We should use a proper servlet container like Jetty:
http://www.eclipse.org/jetty/

Example of embedded servlet:
http://www.eclipse.org/jetty/documentation/9.1.4.v20140401/embedded-examples.html#embedded-minimal-servlet

Generic interface for feature extractors

Build interface for feature extractor shared across collections.

Embed tweets for presentation in search results in NRTS demo

@xeniaqian94 take a look at this:
https://dev.twitter.com/web/embedded-tweets

In the NRTS demo search results, let's embed tweets for presentation. Check out this as an example:
http://lintool.github.io/JScene/search-demo.html

IndexCounter code broken

@LuchenTan IndexCounter code doesn't compile, so master is currently broken right now.

Wrong package
Uses Args class which has been removed. See IndexGov2 for example of how to use args4j
Class has a weird name - can you please rename to DumpDocids or something like that?
Can you please change indentation to 2 spaces instead of tab. Search online for Eclipse code formatter, one of the options is indentation - change to make consistent with everyone else.

Switch to float features before it's too late

@lintool @LuchenTan
Most LTR features are floats anyway, so we should switch to just returning an array of floats, rather than ints. Let's do it now, rather than mess around later.

Indentation size

@iorixxx Do you mind if we agree on code indentation being two spaces, just to be consistent?

If so, can you please reformat your code? I'd rather you do it so better retain history for git blame. Please send pull request.