Giter Site home page Giter Site logo

node-minhash's Introduction

node-minhash

A simple command line tool for comparing text files using the minhash algorithm and contrasting with the jaccard index.

Build Status

References

Installation

If you have just clone this like then run the following

npm install
npm link

Or if you would like to install globally

npm install https://github.com/sjhorn/node-minhash -g

Command line tool usage

Using node

minhash file1.txt file2.txt

minhash https://file.com/page1.html https://file.com/page2.html

Using lib

var minhash = require('node-minhash');

minhash.summary(string1, string2);

Methods

.summary(file1, file2)

Compare two text strings using both minhash and jaccard index and print a summary

.compare(file1, file2)

Compare two text strings using both minhash and jaccard index

.shingles(string, words_per_single=2)

Convert string to set of shingles using the default of 2 words per shingle and tokenise using the natural libraries default tokeniser.

.jaccardIndex(string1, string2)

Compare two strings by tokenising and then compare the intersection of shingles to the union of shingles.

.shingleHashList(set)

Convert a set of shingles to a set of crc-32 hashes.

node-minhash's People

Stargazers

Vitaly Zadorozhny avatar xiaoice avatar  avatar  avatar lagleki avatar Severin M. A. Kistner avatar Changwan Jun avatar Dumitru C. avatar

Watchers

James Cloos avatar

Forkers

knightbk vitaly-z

node-minhash's Issues

Does not work

var minhash = require('node-minhash');
minhash.summary("huhu a", "huhu b");

Minhash similarity is 0 (0% similar)
Jaccard index is 0 (0% similar)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.