Giter Site home page Giter Site logo

amallia / mg4j-workbench Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bitfunnel/mg4j-workbench

0.0 3.0 0.0 1.67 MB

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.

License: GNU Lesser General Public License v3.0

Java 100.00%

mg4j-workbench's Introduction

mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.

Building

Windows

choco install java
choco install maven
mvn package

TODO: set JAVA_HOME?

Linux

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install maven
mvn package

TODO: set JAVA_HOME?

OSX

Coming soon.

IntelliJ

Import pom.xml. Build -> Build Project

// TODO: Describe step-by-step. // TODO: Add pictures.

Creating an mg4j collection.

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar \
     it.unimi.di.big.mg4j.document.TRECDocumentCollection \
     -f HtmlDocumentFactory -p encoding=iso-8859-1 d:\data\work\out2.collection d:\data\gov2\gx000\gx000\00.txt

TODO: -z parameter for gz files. TODO: substute <GOV2 Files ...>

Creating a BitFunnel chunk file from an mg4j collection.

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar \
     org.bitfunnel.reproducibility.GenerateBitFunnelChunks \
      -S <collection file> <chunk file>

Building an mg4j index.

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar \
     it.unimi.di.big.mg4j.tool.IndexBuilder \
      --keep-batches --downcase -S d:\data\work\out2.collection d:\data\work\out2

TODO: Substitute TODO: Add document filter parameter.

Processing a query log.

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar \
     org.bitfunnel.reproducibility.QueryLogRunner \
     <index base name> <query log file> <output file> [-t threadCount]

Exporting a Partitioned Elias-Fano Index

It is possible to export the mg4j index in a format usable by the Partitioned Elias-Fano Index project. The optional --index flag exports the index. The option --queries flag converts a query log file for consumption by the Partitioned Elias-Fano Index. Two query files are generated. The first has queries whose terms have been replaced by their integer term id values. Queries with terms that are not in the index (and therefor don't have term id values) are filtered out. The second query file has the plain text queries corresponding to those in the file of term id queries.

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar \
     org.bitfunnel.reproducibility.IndexExporter \
     <index base name> <output base name> [--index] [--queries <query log file>]

Filtering Query Logs

Note that one can use the IndexExporter, described in the previous section, to generate a filtered query log that contains only those queries whose terms all appear in the index. Just include the --queries parameter and remove the --index parameter.

mg4j-workbench's People

Contributors

mikehopcroft avatar danluu avatar

Watchers

James Cloos avatar Antonio Mallia avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.