We should develop a generic mechanism to store and use Waterloo spam scores, PageRank,

This is probably the right way to implement this: <a href="https://lucene.apache.org/c

Integrate Waterloo spam scores and other static priors into index about anserini HOT 16 CLOSED

castorini commented on May 27, 2024

Integrate Waterloo spam scores and other static priors into index

from anserini.

Comments (16)

lintool commented on May 27, 2024 1

This is probably the right way to implement this: https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/document/FeatureField.html

from anserini.

iorixxx commented on May 27, 2024

Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical to the ClueWeb09B *.warc directory structure.

To perform the aforementioned operation, I rely on voldemort to store docID, and spamScore pairs of category B.

However, when dealing with category A, voldemort can be the only source during indexing. No need to dump data into files into a directory structure that is identical to CW09B.

What do you think about storing spam scores, in key value database such as voldemort?
For fast retrieval during indexing and/or searching?

Is this feasible ? Or generic enough?

from anserini.

lintool commented on May 27, 2024

Hrm... that's pretty heavyweight and requires an external dependency. I suppose for catB everything can fit in memory. Perhaps we can assume the same for catA? 500m * ( 2 bytes for value + 4 bytes for key) = 30 GB... reasonable on a server?

from anserini.

iorixxx commented on May 27, 2024

My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks ( aligned with the warc files ) is heavy. I again rely on voldemort for producing chunks.

Here by saying chunk : I mean a miniature spam ranking file just for a single warc file.

from anserini.

iorixxx commented on May 27, 2024

It looks like, we can resolve warc folder path for a given docid deterministically.
e.g. docid = clueweb09-en0000-00-35369
path = ClueWeb09_English_1/en0000/00.war.gz

Then we can create miniature fusion files from the clueweb09spam.Fusion directly?

Spam scores will be used for skipping documents (given threshold ) during indexing?

from anserini.

lintool commented on May 27, 2024

I'd rather index everything and use spam as a feature during retrieval. That way we don't need to develop a cutoff.

from anserini.

iorixxx commented on May 27, 2024

aha I see. So you just want to percolate the result list?
Then we need ability to query arbitrary document id.
I cannot think of a solution without a key-value database or something.
How about we index spam rankings with lucene? for arbitrary lookup?

from anserini.

lintool commented on May 27, 2024

Just a big hashmap we load into memory at startup? Using, fastuil, for example?

from anserini.

iorixxx commented on May 27, 2024

Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.

from anserini.

iorixxx commented on May 27, 2024

I played with Object2IntOpenHashMap<String> however following program java -server -Xmx20g resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this big file will take time. What is the preferred course of action here?

/**
     * Try to load clueweb09spam.Fusion (15 GB) file to memory
     *
     * @param clueweb09spam spam file name
     * @throws IOException
     */
 public static void loadSpamFusion(String clueweb09spam) throws IOException {

        Object2IntOpenHashMap<String> map = new Object2IntOpenHashMap<>();

        Path clueweb09spamFusion = Paths.get(clueweb09spam);

        if (!Files.isRegularFile(clueweb09spamFusion) || !Files.exists(clueweb09spamFusion) || !Files.isReadable(clueweb09spamFusion))
            throw new IllegalArgumentException(clueweb09spamFusion + " does not exist or is not a file");


        try (BufferedReader reader = Files.newBufferedReader(clueweb09spamFusion, StandardCharsets.US_ASCII)) {

            for (; ; ) {
                String line = reader.readLine();
                if (line == null)
                    break;

                // lines with the following format: percentile-score clueweb-docid
                String[] parts = line.split("\\s+");
                map.put(parts[1], Integer.parseInt(parts[0]));
            }
        }

        System.out.println(map.size() + "many entries loaded into the map");
        map.clear();
    }

from anserini.

lintool commented on May 27, 2024

How much memory do you have on your machine?
The machine I use at UMD has 0.75 TB RAM :)

from anserini.

iorixxx commented on May 27, 2024

I have 64 GB :) Is there a maximum -Xmx value we should aim here?
Can you try the loading code? I wonder how much heap it will take.

from anserini.

lintool commented on May 27, 2024

Try using max heap?

from anserini.

iorixxx commented on May 27, 2024

with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates trec submission file.

from anserini.

iorixxx commented on May 27, 2024

I found a better data structure ReferenceOpenHashSet<String> for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and waterloo spam scores file/folder. And then it will remove spammiest documents from the submission file. Does this reasonable?

from anserini.

lintool commented on May 27, 2024

Hi @iorixxx sorry for the late reply - was at TREC and starting to dig out of a backlog. Yes, this seems reasonable!

from anserini.

Integrate Waterloo spam scores and other static priors into index about anserini HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent