Giter Site home page Giter Site logo

simsearch's Introduction

This module is an implementation of Bayesian Sets. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.

For example, for the query with the two animated movies, "Lilo & Stitch" and "Up", Bayesian Sets would return other similar animated movies, like "Toy Story".

This module also adds the novel ability to combine full text search with item based search. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords but are re-ranked by how similar to the queried items.

This implementation has been tested on datasets with millions of documents and hundreds of thousands of features. It has become an integrant part of Cloud Mining. At the moment only features of bag of words are supported. However it is faily easy to change the code to make it work on other feature types.

This module works as follow:

  1. First a configuration file has to be written (have a look at tools/sample_config.py). The most important variable holds the list of features to index. Those are indexed with SQL queries of the type:

    sql_features = ['select id as item_id, word as feature from table']

Note that id and word must be aliased as item_id and feature respectively.

  1. Now use tools/index_features.py on the configuration file to index those features.

    python tools/index_features.py config.py

The indexer will create a computed index named index.dat in your working directory. A computed index is a pickled file with all its hyper parameters already computed and with the matrix in CSR format.

  1. You can now test this index:

    python tools/query_index.py index.dat

  2. The script query_index.py will load the index in memory each time. In order to load it only once, you can serve the index with some client/server code (see client_server code). The index can also be loaded along side the web application. In webpy web.config dictionnary can be used for this purpose.

This module relies and Sphinx and fSphinx to perform the full-text and item based search combination. A regular sphinx client is wrapped together with a computed index, and a function called setup_sphinx is called upon similarity search. This function resets the sphinx client if an item based query is encountered.

Here is an example of a setup_sphinx function:

# this is only used for sim_sphinx (see doc)
def sphinx_setup(cl):
    import sphinxapi
    
    # custom sorting function for the search
    # we always make sure highly ranked items with a log score are at the top.
    cl.SetSortMode(sphinxapi.SPH_SORT_EXPR, '@weight * log_score_attr')'
    
    # custom grouping function for the facets
    group_func = 'sum(log_score_attr)'
    
    # setup sorting and ordering of each facet 
    for f in cl.facets:
        # group by a custom function
        f.SetGroupFunc(group_func)

Note that the log_scores are found in the Sphinx attributes log_score_attr. It must be set to 1 and declared as a float in your Sphinx configuration file:

# log_score_attr must be set to 1
sql_query            = \
    select *,\
        1 as log_score_attr,\
    from table

# log_score_attr will hold the log scores after item base search
sql_attr_float = log_score_attr

There is a nice blog post about item based search with Bayesian Sets. Feel free to read through it.

That's it for the documentation. Have fun playing with item based search and don't forget to leave feedback.

simsearch's People

Contributors

alexksikes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tankiit antmd akatie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.