Comments (8)
I redo the Gov2 indexing on Hops at UWaterloo, and the same number of documents, 24900602 docs indexed.
from anserini.
1000 documents don't appear from nowhere
from anserini.
As Charlie suggested in email (and the sensical thing to do), we should dump out the docids and diff.
@LuchenTan so you can start learning the Lucene APIs, etc., can you write a simple program that does this?
Fork the repo, add in your code, and send a pull request. The ultimate product should be a command I can copy and paste to generate the output (documented in the README).
Thanks!
from anserini.
OK, I'll do it.
from anserini.
Update on this: as is turned out, there was a minor corruption in the copy of Gov2 at UMD. After fixing this and rerunning the indexer, I get 24900602 docs, which machines @LuchenTan's number.
from anserini.
Yes, exactly the same number of mine.
On Thu, Oct 15, 2015 at 2:02 PM, Jimmy Lin [email protected] wrote:
Update on this: as is turned out, there was a minor corruption in the copy
of Gov2 at UMD. After fixing this and rerunning the indexer, I get 24900602
docs, which machines @LuchenTan https://github.com/LuchenTan's number.—
Reply to this email directly or view it on GitHub
#7 (comment).
Luchen Tan
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
from anserini.
Resolving issue... basically, no issue.
from anserini.
Merged in Luchen code to dump out docids in an index:
commit 491dd9e
from anserini.
Related Issues (20)
- Add test cases for MIRACL dev set HOT 1
- There is currently no way to get the underlying IndexReader from SimpleSearcher
- Missing docvector in cw12b13
- Refactoring HNSW Lucene classes
- Unit tests for HNSW vector retrieval HOT 1
- Missing msmarco-doc-segmented-wp.yaml condition HOT 1
- Unique terms not available in IndexReaderUtils HOT 2
- Index Size for Impact indexes HOT 1
- Update SimpleIndexer Args
- Naming for index and search classes HOT 1
- ClassCastException when indexing ACL Anthology HOT 2
- [feature request] Specify the json field to index via a cli parameter
- Figure out how ONNX works cross-platform HOT 1
- Error: Could not find or load main class io.anserini.search.SearchMsmarco HOT 1
- Dropbox links for pre-built indices not accessible HOT 1
- Problem with indexing ACLAnthology HOT 7
- Add ability to parse raw text into docvectors on-the-fly for impact indexes HOT 7
- Regression pages, links to topics/qrels broken
- Verify ONNX repo, add ONNX model conversion documentation HOT 1
- Verify SPLADE++ models on MS MARCO V2 passage HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.