Giter Site home page Giter Site logo

qdr's Introduction

qdr

Build Status

Query-Document Relevance ranking functions

This repository implements a few query-document similarity functions, commonly used in information retrieval applications. It supports:

  • TF-IDF
  • Okapi BM25
  • Language Model

This implementation includes pure Python code for iteratively training models from a large corpus, and a C++ implementation of the scoring functions with Cython wrappers for fast evaluation.

Each of these ranking functions has a few "magic" constants. Currently these are hard coded to values recommend in the literature, but if the need arises can be configurable. Relevant references:

Usage

All tokenization and word normalization is handled client side, and all methods that accept queries or documents assume they are lists of byte strings, not unicode.

There are two separate steps to using the ranking functions: training and scoring.

Training

The Trainer class supports incremental training from a large corpus, combining separately trained models for map-reduce type data flows, pruning of infrequent tokens from large models and serialization. Typical usage:

from qdr import Trainer

# load corpus -- it's an iterable of documents, each document is a
# list of byte strings
model = Trainer()
model.train(corpus)

# the train method adds documents incrementally so it can be updated with
# additional documents by calling train again
model.train(another_corpus)

# write to a file
model.serialize_to_file(outputfile)

For map-reduce type work, the method update_counts_from_trained will merge the contents of two Trainer instances:

# map step -- typically this is parallelized
for k, corpus in enumerate(corpus_chunks):
    model = Trainer()
    model.train(corpus)
    model.serialize_to_file("file%s.gz" % k)

# reduce step
model = Trainer.load_from_file("file0.gz")
for k in xrange(1, len(corpus_chunks)):
    model2 = Trainer.load_from_file("file%s.gz" % k)
    model.update_counts_from_trained(model2)

# prune the final model if needed
model.prune(min_word_count, min_doc_count)

Scoring

Typical usage:

from qdr import QueryDocumentRelevance

scorer = QueryDocumentRelevance.load_from_file('trained_model.gz')
# document, query are lists of byte strings
relevance_scores = scorer.score(document, query)

For scoring batches of queries against a single document, the score_batch method is more efficient then calling score repeatedly:

# queries is a list of queries, each query is a list of tokens:
relevance_scores = scorer.score(document, queries)

Installing

sudo pip install -r requirements.txt
sudo make install

Contributing

Contributions welcome! Fork, commit, then open a pull request.

qdr's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.