castorini / anserini Goto Github PK
View Code? Open in Web Editor NEWAnserini is a Lucene toolkit for reproducible information retrieval research
Home Page: http://anserini.io/
License: Apache License 2.0
Anserini is a Lucene toolkit for reproducible information retrieval research
Home Page: http://anserini.io/
License: Apache License 2.0
For the Twitter NRTS demo, we should probably use a proper templating system like mustache to avoid System.out.println
nightmare: https://github.com/spullara/mustache.java
We probably need some relevance feedback model... RM3 is probably our best bet.
@LuchenTan @xeniaqian94 Let's start with a simple two-feature LTR implementation for Tweets:
Let's build a LTR implementation that just has two features: RM3 score + number hash tags. Inside your new reranker, you already have the RM3 score; use getField
on the document to pull out the text, and then just count the number of hashtags. Print out a line like this:
1 325263 0.432 3
Topic 1, docid 325263, RM3 score of 0.432, 3 hashtags. Dump this information for all docs.
You'll need to take this file and join it with qrels to get the relevance judgments (i.e., write a simple Python script to do it. So you'll end up with a file like:
1 325263 0.432 3 1
The final column is the relevance judgment. Now you can run learning to rank using http://sourceforge.net/p/lemur/wiki/RankLib/
We should probably merge all segments into one, perhaps with a -optimize
flag.
@xeniaqian94 It would be great for you to get some experience running end-to-end ad hoc experiments, which is a core activity of IR research. Let's start with something simple, like playing with different analyzers - currently, the tweet indexing uses PorterStemFilter. Try removing it and see what the effect is. So:
It would be nice to also know the effects of indexing only English tweets, using same procedure above.
In selective search, the document collection is divided into different partitions (e.g., by clustering). Write an indexer that takes a cluster mapping (docid to clusterid mapping) and builds the right indexes - i.e., puts the documents in the appropriate partition index.
@iorixxx Please check out my branch cw09b-refactoring
I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:
Args
class in IndexClueWeb09b
(we should just be using commons-cli
), and in general, make the logging, cmdline options, etc. consistent?Thanks!
Get basic indexing/retrieval working on TREC Microblog track data from 2011 to 2014. Let's start with TREC 2011 and TREC 2012 microblog data since the corpus is smaller...
The CACM collection is small enough that we can include it in the repository... so we can have indexing/retrieval experiments completely integrated in with the system.
specifically
sh target/appassembler/bin/DumpTweetsLtrData -index tweets2011-index/ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -output ltr.data.txt -qrels src/main/resources/topics-and-qrels/qrels.microblog2011.txt -ql
We should be dividing by the length of the vector here:
https://github.com/lintool/Anserini/blob/master/src/main/java/io/anserini/rerank/rm3/FeatureVector.java#L167
Since RM3 uses this, RM3 results are probably wrong.
Luchen is working on a Twitter tokenizer.
Write an indexer that indexes tweets into multiple partitions - simple round robin strategy would be a reasonable start.
@claclark The feature data that the LTR module generates needs to have access to the qrels so the relevance grade can be folded directly into the output.
Let's implement baselines for ClueWeb09b, and then push it to all of ClueWeb09.
Take advantage of classes for frequency distributions in Lintools: https://github.com/lintool/tools
Before we get too far into hacking on Anserini, we should probably decide on how we want to deal with comments.
Do we want to do Javadoc? Something else?
We're currently using QueryParser to parse TREC topics, which means that symbols in the topics like parentheses and quotes get interpreted as query operators... this isn't the desired behavior.
Quite impressively, I was able to index all of ClueWeb09 (English):
nohup sh target/appassembler/bin/IndexClueWeb09b \
-input /scratch1/collections/ClueWeb09.English/data/ \
-index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &
Took ~18 hours:
2015-10-16 07:51:04,775 INFO [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04
Index size (note: no positions):
$ du -h lucene-index.cw09.cnt/
254G lucene-index.cw09.cnt/
@yb1 You probably want to dump out the cleaned text in a simple text format, something like this:
URL1 document1 ....
URL2 document2 ...
And write an indexer for it. Look at IndexTweets.java
and IndexWebCollection.java
here:
https://github.com/lintool/Anserini/tree/master/src/main/java/io/anserini/index
The tweets indexer should be fairly easy to understand - it's single-threaded so it's slower. IndexWebCollection
is multi-threaded and thus much faster.
I would start with a single-threaded implementation. Call the class IndexPlainText
or something like that.
We're currently being stupid and recreating a reranker cascade for every query. Don't do this. Create a cascade at the beginning and point at the context per query.
@claclark I assume you have no objections?
According to @iorixxx
EnglishAnalyzer: PorterStemmer is aggressive, and stop word removal would make certain queries (the wall, the current, the sun, to be or not to be) meaningless.
I think analysis should be minimum.
We should play with different analyzers and evaluate impact on effectiveness.
Let's implement baselines for ClueWeb12-B13, and then push it to all of ClueWeb12.
For now everything is based on Warc formatted records.
We'd have other types of records too, e.g. Trec text or maybe other types in the future.
It is better to have a base record and everything inherits it.
@claclark Ranklib is here: http://sourceforge.net/p/lemur/wiki/RankLib/
Massage our LTR data dumper to produce data files that can be directly read by Ranklib so we can have an end-to-end training/cross-validation pipeline.
From @aroegies
If you tell me what fields are desired from:
https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_3_0.thrift
Open questions:
Then I should be able to quickly put together a script to re-crawl, format, and encode in JSON the documents.
Likely we just want to use the entire KBA dataset rather than the TST subset but whatever.
Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.
Weird - some non-determinism the multi-threading?
Not that important if we can replicate effective results on standard test collections, but worth noting.
It seems what we need is a generic document reranking interface: takes a document ranking and spits another document ranking back out. This would implement a standard multi-stage retrieval pipeline: e.g., BM25 (or QL) + 1st stage reranker + 2nd stage reranker, etc.
Like what we show in README.md for gov2, clueweb09 and clueweb12 it's better to have them available for disk1-5 and AQUAINT for reference.
We need to add query features from Macdonald et al., CIKM 2012
"On the Usefulness of Query Features for Learning to Rank"
http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12queryf.pdf
We probably also want to implement some type of phrasal query model... SDM seems like a good start.
Instead of calling/invoking thread.start() manually, we can switch to ThreadPoolExecutor, which usually provides improved performance when executing large numbers of asynchronous tasks.
For the Gov2 indexing class, we should probably add an option to force the user to specify whether they want count or positional index, as to reduce confusion?
Luchen needs this to build query relevance profiles.
@aroegies @xeniaqian94 Can you two coordinate on making this happen?
RTS mobile push broker: https://github.com/aroegies/trecrts-tools
It would make sense to create a RerankerCascade abstraction for running a whole bunch of sequence of rerankers. Something like:
RerankerCascade cascade = new RerankerCascade(context).add(foo).add(bar);
cascade.run(docs);
Lucene query parser gives the following error if the query has wildcard characters in it:
'*' or '?' not allowed as first character in WildcardQuery
Ex: Cannot parse 'where is the Eldorado Casino in Reno ?': '*' or '?' not allowed as first character in WildcardQuery.
The TwitterStream class in twitter4j abstracts over the live Twitter stream. We need to find a way to mock the object so that we can "replay" a stream from previously-stored tweets.
The current implementation of TweetSearcherServer parses raw HTTP requests.
We should use a proper servlet container like Jetty:
http://www.eclipse.org/jetty/
Example of embedded servlet:
http://www.eclipse.org/jetty/documentation/9.1.4.v20140401/embedded-examples.html#embedded-minimal-servlet
Build interface for feature extractor shared across collections.
@xeniaqian94 take a look at this:
https://dev.twitter.com/web/embedded-tweets
In the NRTS demo search results, let's embed tweets for presentation. Check out this as an example:
http://lintool.github.io/JScene/search-demo.html
@LuchenTan IndexCounter code doesn't compile, so master
is currently broken right now.
Args
class which has been removed. See IndexGov2
for example of how to use args4jDumpDocids
or something like that?@lintool @LuchenTan
Most LTR features are floats anyway, so we should switch to just returning an array of floats, rather than ints. Let's do it now, rather than mess around later.
@iorixxx Do you mind if we agree on code indentation being two spaces, just to be consistent?
If so, can you please reformat your code? I'd rather you do it so better retain history for git blame
. Please send pull request.
Thanks!
Implement baselines for the Wt10g collection.
I have a bunch of code for indexing/searching Wikipedia:
https://github.com/lintool/wiki-tools
Should pull into this repo...
We should develop a generic mechanism to store and use Waterloo spam scores, PageRank, HITS, and other static priors.
@iorixxx Do you have some code to contribute along these lines?
Current implementation of RM3 works for Tweets... let's see if it works for Gov2.
Need to build a Gov2 index that stores doc vectors.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.