Giter Site home page Giter Site logo

Comments (16)

lintool avatar lintool commented on May 27, 2024 1

This is probably the right way to implement this: https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/document/FeatureField.html

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical to the ClueWeb09B *.warc directory structure.

To perform the aforementioned operation, I rely on voldemort to store docID, and spamScore pairs of category B.

However, when dealing with category A, voldemort can be the only source during indexing. No need to dump data into files into a directory structure that is identical to CW09B.

What do you think about storing spam scores, in key value database such as voldemort?
For fast retrieval during indexing and/or searching?

Is this feasible ? Or generic enough?

from anserini.

lintool avatar lintool commented on May 27, 2024

Hrm... that's pretty heavyweight and requires an external dependency. I suppose for catB everything can fit in memory. Perhaps we can assume the same for catA? 500m * ( 2 bytes for value + 4 bytes for key) = 30 GB... reasonable on a server?

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks ( aligned with the warc files ) is heavy. I again rely on voldemort for producing chunks.

Here by saying chunk : I mean a miniature spam ranking file just for a single warc file.

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

It looks like, we can resolve warc folder path for a given docid deterministically.
e.g. docid = clueweb09-en0000-00-35369
path = ClueWeb09_English_1/en0000/00.war.gz

Then we can create miniature fusion files from the clueweb09spam.Fusion directly?

Spam scores will be used for skipping documents (given threshold ) during indexing?

from anserini.

lintool avatar lintool commented on May 27, 2024

I'd rather index everything and use spam as a feature during retrieval. That way we don't need to develop a cutoff.

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

aha I see. So you just want to percolate the result list?
Then we need ability to query arbitrary document id.
I cannot think of a solution without a key-value database or something.
How about we index spam rankings with lucene? for arbitrary lookup?

from anserini.

lintool avatar lintool commented on May 27, 2024

Just a big hashmap we load into memory at startup? Using, fastuil, for example?

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

I played with Object2IntOpenHashMap<String> however following program java -server -Xmx20g resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this big file will take time. What is the preferred course of action here?

/**
     * Try to load clueweb09spam.Fusion (15 GB) file to memory
     *
     * @param clueweb09spam spam file name
     * @throws IOException
     */
 public static void loadSpamFusion(String clueweb09spam) throws IOException {

        Object2IntOpenHashMap<String> map = new Object2IntOpenHashMap<>();

        Path clueweb09spamFusion = Paths.get(clueweb09spam);

        if (!Files.isRegularFile(clueweb09spamFusion) || !Files.exists(clueweb09spamFusion) || !Files.isReadable(clueweb09spamFusion))
            throw new IllegalArgumentException(clueweb09spamFusion + " does not exist or is not a file");


        try (BufferedReader reader = Files.newBufferedReader(clueweb09spamFusion, StandardCharsets.US_ASCII)) {

            for (; ; ) {
                String line = reader.readLine();
                if (line == null)
                    break;

                // lines with the following format: percentile-score clueweb-docid
                String[] parts = line.split("\\s+");
                map.put(parts[1], Integer.parseInt(parts[0]));
            }
        }

        System.out.println(map.size() + "many entries loaded into the map");
        map.clear();
    }

from anserini.

lintool avatar lintool commented on May 27, 2024

How much memory do you have on your machine?
The machine I use at UMD has 0.75 TB RAM :)

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

I have 64 GB :) Is there a maximum -Xmx value we should aim here?
Can you try the loading code? I wonder how much heap it will take.

from anserini.

lintool avatar lintool commented on May 27, 2024

Try using max heap?

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates trec submission file.

from anserini.

iorixxx avatar iorixxx commented on May 27, 2024

I found a better data structure ReferenceOpenHashSet<String> for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and waterloo spam scores file/folder. And then it will remove spammiest documents from the submission file. Does this reasonable?

from anserini.

lintool avatar lintool commented on May 27, 2024

Hi @iorixxx sorry for the late reply - was at TREC and starting to dig out of a backlog. Yes, this seems reasonable!

from anserini.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.