Comments (16)
This is probably the right way to implement this: https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/document/FeatureField.html
from anserini.
Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical to the ClueWeb09B *.warc directory structure.
To perform the aforementioned operation, I rely on voldemort to store docID, and spamScore pairs of category B.
However, when dealing with category A, voldemort can be the only source during indexing. No need to dump data into files into a directory structure that is identical to CW09B.
What do you think about storing spam scores, in key value database such as voldemort?
For fast retrieval during indexing and/or searching?
Is this feasible ? Or generic enough?
from anserini.
Hrm... that's pretty heavyweight and requires an external dependency. I suppose for catB everything can fit in memory. Perhaps we can assume the same for catA? 500m * ( 2 bytes for value + 4 bytes for key) = 30 GB... reasonable on a server?
from anserini.
My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks ( aligned with the warc files ) is heavy. I again rely on voldemort for producing chunks.
Here by saying chunk : I mean a miniature spam ranking file just for a single warc file.
from anserini.
It looks like, we can resolve warc folder path for a given docid deterministically.
e.g. docid = clueweb09-en0000-00-35369
path = ClueWeb09_English_1/en0000/00.war.gz
Then we can create miniature fusion files from the clueweb09spam.Fusion directly?
Spam scores will be used for skipping documents (given threshold ) during indexing?
from anserini.
I'd rather index everything and use spam as a feature during retrieval. That way we don't need to develop a cutoff.
from anserini.
aha I see. So you just want to percolate the result list?
Then we need ability to query arbitrary document id.
I cannot think of a solution without a key-value database or something.
How about we index spam rankings with lucene? for arbitrary lookup?
from anserini.
Just a big hashmap we load into memory at startup? Using, fastuil, for example?
from anserini.
Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.
from anserini.
I played with Object2IntOpenHashMap<String>
however following program java -server -Xmx20g
resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this big file will take time. What is the preferred course of action here?
/**
* Try to load clueweb09spam.Fusion (15 GB) file to memory
*
* @param clueweb09spam spam file name
* @throws IOException
*/
public static void loadSpamFusion(String clueweb09spam) throws IOException {
Object2IntOpenHashMap<String> map = new Object2IntOpenHashMap<>();
Path clueweb09spamFusion = Paths.get(clueweb09spam);
if (!Files.isRegularFile(clueweb09spamFusion) || !Files.exists(clueweb09spamFusion) || !Files.isReadable(clueweb09spamFusion))
throw new IllegalArgumentException(clueweb09spamFusion + " does not exist or is not a file");
try (BufferedReader reader = Files.newBufferedReader(clueweb09spamFusion, StandardCharsets.US_ASCII)) {
for (; ; ) {
String line = reader.readLine();
if (line == null)
break;
// lines with the following format: percentile-score clueweb-docid
String[] parts = line.split("\\s+");
map.put(parts[1], Integer.parseInt(parts[0]));
}
}
System.out.println(map.size() + "many entries loaded into the map");
map.clear();
}
from anserini.
How much memory do you have on your machine?
The machine I use at UMD has 0.75 TB RAM :)
from anserini.
I have 64 GB :) Is there a maximum -Xmx
value we should aim here?
Can you try the loading code? I wonder how much heap it will take.
from anserini.
Try using max heap?
from anserini.
with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates trec submission file.
from anserini.
I found a better data structure ReferenceOpenHashSet<String>
for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and waterloo spam scores file/folder. And then it will remove spammiest documents from the submission file. Does this reasonable?
from anserini.
Hi @iorixxx sorry for the late reply - was at TREC and starting to dig out of a backlog. Yes, this seems reasonable!
from anserini.
Related Issues (20)
- Chain of Responsibility Pattern concerns
- Strategy Design Pattern concerns
- Reproduce "End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene" with pre-built indexes HOT 1
- Basic rank fusion implementation in Anserini HOT 1
- SearchCollection -rf.qrels option HOT 1
- Errors with openai-ada2-int8 regressions: GCLocker errors HOT 4
- error
- Cache path change
- Maven build / test issue HOT 2
- Add DL19/DL20 for Cohere V3 embeddings HOT 2
- Anserini Retrieval latency question - Mono thread/CPU ?
- bge-base-en-v1.5 encoder query length issues HOT 1
- Allow trec_eval to take symbols representing standard qrels (instead of full qrel files) HOT 7
- Upgrade JDK? HOT 4
- Add dl22 docs to Anserini HOT 2
- Change local filename of downloaded pre-built index HOT 4
- Duplicate downloading of ONNX files for test cases?
- Can't run 2CR on pre-built indexes directly on fatjar - can't read YAML files HOT 14
- Building anserini on MacOS HOT 21
- Missing appassembler-maven-plugin:2.1.0:assemble HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.