Giter Site home page Giter Site logo

pombredanne / sampled-minhashing Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gibranfp/sampled-minhashing

0.0 3.0 0.0 530 KB

A method to mine beyond-pairwise relationships using Min-Hashing for large-scale pattern discovery

CMake 2.17% C++ 6.18% Objective-C 1.42% C 83.50% Python 6.72%

sampled-minhashing's Introduction

Sampled-MinHashing

Sampled Min-Hashing (SMH) is a simple and scalable method to discover patterns from large-scale dyadic data (e.g. bag of words). SMH relies on Min-Hashing to efficiently mine beyond-pairwise relationships which are clustered to form the final discovered patterns. SMH has been successfully applied to the discovery of objects from image collections and topics from text corpora. This repository includes a C implementation of SMH together with SWIG Python bindings.

Installation

Install the dependencies:

sudo apt-get install cmake python swig libpython-dev

Clone and coompile the library:

git clone https://github.com/gibranfp/Sampled-MinHashing.git
cd Sampled-MinHashing
mkdir build
cd build
cmake ..
make

To do a system-wide installation

sudo make install

Alternatively, you can use it locally by adding the absolute path of the bin directory inside Sampled-MinHashing to the system path:

export PATH=$PATH:[absolute_path_to_sampled_minhashing]/bin

And the absolute path of the python/smh directory inside build to Python's path:

export PYTHONPATH=$PYTHONPATH:[absolute_path_to_sampled_minhashing]/build/python/smh

To uninstall the library from your system do:

sudo make uninstall

Example Usage

Getting NIPS Corpus

To discover topics from the NIPS corpus using Sampled-MinHashing, first download and extract the corpus to a given location:

wget http://arbylon.net/projects/nips/nips-20110223.zip
unzip nips-20110223.zip

This creates the directory knowceans-ilda/nips where the corpus is located. The file nips.corpus inside this directory contains a database of N lists corresponding to the bag-of-words of the N documents in the corpus. The format of the file is as follows:

size_of_list_1 item1_1:freq1_1 item2_1:freq2_1 ...
size_of_list_1 item1_2:freq1_22 item2_2:freq2_2 ...
...                        ...
size_of_list_N item1_N:freq1_N item2_N:freq2_N ...

For example, if you have a corpus of 5 documents with a vocabulary of 19 different term, the file could look like this:

6 3:9 4:8 7:5 12:1 16:5 18:5 
3 2:7 3:4 8:5
4 1:9 2:10 16:8 17:10
4 10:10 11:4 15:8 16:3
3 0:1 14:9 15:10

Creating Inverted File Structure

To perform topic discovery you'll need to load the corpus and create the inverted file structure. This can be done using the standalone command smhcmd:

smhcmd ifindex knowceans-ilda/nips/nips.corpus knowceans-ilda/nips/nips.ifs

Or from Python:

import smh
corpus = smh.listdb_load('knowceans-ilda/nips/nips.corpus')
ifs = corpus.invert()
ifs.save('knowceans-ilda/nips/nips.ifs')

Discovering Topics

Once you have the inverted file, to discover topics from the standalone smhcmd command you need to do

smhcmd discover~/knowceans-ilda/nips/nips.ifs ~/knowceans-ilda/nips/nips.models

From Python:

import smh
corpus = smh.listdb_load('knowceans-ilda/nips/nips.corpus')
ifs = smh.listdb_load('knowceans-ilda/nips/nips.ifs')
discoverer = smh.SMHDiscoverer()
models = discoverer.fit(ifs, expand = corpus)
models.save('knowceans-ilda/nips/nips.models')

To visualize the topics as sets of terms, load the vocabulary file and map term IDs to terms:

vocabulary = {}
with open('knowceans-ilda/nips/nips.vocab', 'r') as f:
	content = f.readlines()
	for line in content:
        	tokens = line.split(' = ')
        	vocabulary[int(tokens[1])] = tokens[0]

topics = []
for m in models.ldb:
	terms = []
	for j in m:
        	terms.append(vocabulary[j.item])
	topics.append(terms)

Finally save the lists of terms to a file:

with open('knowceans-ilda/nips/nips.terms', 'w') as f:
	for t in topics:
		f.write(' '.join(t).encode('utf8'))
		f.write('\n'.encode('utf8'))

References

sampled-minhashing's People

Contributors

gibranfp avatar ivanvladimir avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.