Giter Site home page Giter Site logo

bigmlcom / sketchy Goto Github PK

View Code? Open in Web Editor NEW
146.0 22.0 18.0 151 KB

Sketching Algorithms for Clojure (bloom filter, min-hash, hyper-loglog, count-min sketch)

License: Other

Clojure 89.99% Java 4.20% Shell 5.81%
hashing sketching bloom-filter minhash count-min-sketch hyperloglog clojure

sketchy's Introduction

Sketching Algorithms in Clojure

Installation

sketchy is available as a Maven artifact from Clojars.

Clojars Project

Overview

This library contains various sketching/hash-based algorithms useful for building compact summaries of large datasets.

All the sketches are composed using vanilla Clojure data structures. That means immutability and easy serialization but humble performance. stream-lib is a good alternative for those in need of speed.

General Utilities:

Sketching/hash-based algorithms:

As we review each section, feel free to follow along in the REPL. Note that bigml.sketchy.test.demo loads Hamlet and A Midsummer Night's Dream into memory for our code examples.

user> (ns test
        (:use [bigml.sketchy.test.demo])
        (:require (bigml.sketchy [murmur :as murmur]
                                 [bits :as bits]
                                 [bloom :as bloom]
                                 [min-hash :as min-hash]
                                 [hyper-loglog :as hll]
                                 [count-min :as count-min])))

MurmurHash

The bigml.sketchy.murmur namespace makes it easy to generate seeded Murmur hashes. Murmur hashes are popular as they are reasonably quick to produce and adequately random.

These Murmur hashes are all produced as 64 bit longs. A simple example hashing the string "foo" to a long:

test> (murmur/hash "foo")
6231696022289519434

Anything that clojure.core/hash accepts may also be used with this hash fn:

test> (murmur/hash {:foo "bar"})
-7720779806311024803

An optional seed parameter selects a unique hashing function. Anything that's hashable by clojure.core/hash is valid as a seed.

test> (murmur/hash "foo" 0)
6231696022289519434
test> (murmur/hash "foo" 42)
-8820575662888368925
test> (murmur/hash "foo" :bar)
-8527955061573093315

The truncate function can be used to truncate the number of bits (must be less than 64 and more than 0).

test> (murmur/truncate (murmur/hash "foo") 32)
3972535114
test> (murmur/truncate (murmur/hash "foo") 16)
4938
test> (murmur/truncate (murmur/hash "foo") 8)
74

If you need multiple unique hashes for a value, hash-seq is a convenience function for that. It applies an infinite sequence of unique hash functions (always in the same order), so take as many as you need.

test> (take 3 (murmur/hash-seq "foo"))
(6231696022289519434 -1965669315023635442 -4826411765733908310)

Immutable Bitset

Besides being my favorite name for a namespace, bigml.sketchy.bits provides an immutable bitset supporting bit-level operations for any number of bits. The bitset is backed by a vector of longs.

The create function builds a bitset given the desired number of bits. Every bit will be initialized as clear (all zero).

The set function sets the bits at the given indicies. The test function returns true if the bit at the given index is set.

test> (def my-bits (-> (bits/create 256)
                       (bits/set 2 48 58 184 233)))
test> (bits/test my-bits 47)
false
test> (bits/test my-bits 48)
true

The set-seq function returns the indicies of every set bit. Alternatively, clear-seq returns all the clear bits.

test> (bits/set-seq my-bits)
(2 48 58 184 233)

The clear function complements set by clearing the bits for the given indices. Similarly, the flip function reverses a bit's state.

test> (bits/set-seq (bits/clear my-bits 48))
(2 58 184 233)
test> (bits/set-seq (bits/flip my-bits 48))
(2 58 184 233)

Moreover, the namespace offers functions to and and or two bitsets. You can also measure hamming-distance, jaccard-similarity, or cosine-similarity.

Bloom Filter

bigml.sketchy.bloom contains an implementation of a Bloom filter, useful for testing set membership. When checking set membership for an item, false positives are possible but false negatives are not.

You may create a Bloom filter by providing the expected number of items to be inserted into the filter and the acceptable false positive rate.

After creating the filter, you may either insert individual items or add an entire collection of items into the Bloom filter.

test> (def hamlet-bloom
        (reduce bloom/insert
                (bloom/create (count hamlet-tokens) 0.01)
                hamlet-tokens))

test> (def midsummer-bloom
        (bloom/into (bloom/create (count midsummer-tokens) 0.01)
                    midsummer-tokens))

Item membership is tested with contains?.

test> (bloom/contains? hamlet-bloom "puck")
false
test> (bloom/contains? midsummer-bloom "puck")
true

The Bloom filters are also merge friendly as long as they are initialized with the same parameters.

test> (def summerham-bloom
        (let [total (+ (count hamlet-tokens) (count midsummer-tokens))]
          (bloom/merge (bloom/into (bloom/create total 0.01) midsummer-tokens)
                       (bloom/into (bloom/create total 0.01) hamlet-tokens))))
test> (bloom/contains? summerham-bloom "puck")
true
test> (bloom/contains? summerham-bloom "yorick")
true
test> (bloom/contains? summerham-bloom "henry")
false

Min-Hash

bigml.sketchy.min-hash contains an implementation of the MinHash algorithm, useful for comparing the Jaccard similarity of two sets.

This implementation includes the improvements recommended in "Improved Densification of One Permutation Hashing", which greatly reduces the algorithmic complexity for building a MinHash.

To create a MinHash, you may provide a target error rate for similarity (default is 0.05). After that, you can either insert individual values or add collections into the MinHash.

In the following example we break A Midsummer Night's Dream into two halves (midsummer-part1 and midsummer-part2) and build a MinHash for each. We then compare the two parts together to see if they are more similar than a MinHash of Hamlet.

As we'd expect, the two halves of A Midsummer Night's Dream are more alike than Hamlet.

test> (def hamlet-hash (min-hash/into (min-hash/create) hamlet-tokens))
test> (def midsummer1-hash (min-hash/into (min-hash/create) midsummer-part1))
test> (def midsummer2-hash (min-hash/into (min-hash/create) midsummer-part2))
test> (min-hash/jaccard-similarity midsummer1-hash midsummer2-hash)
0.2852
test> (min-hash/jaccard-similarity midsummer1-hash hamlet-hash)
0.2012

The MinHashes are merge friendly as long as they're initialized with the same target error rate.

test> (def midsummer-hash (min-hash/into (min-hash/create) midsummer-tokens))
test> (min-hash/jaccard-similarity midsummer-hash
                                   (min-hash/merge midsummer1-hash
                                                   midsummer2-hash))
1.0

Hyper-LogLog

bigml.sketchy.hyper-loglog contains an implementation of the HyperLogLog sketch, useful for estimating the number of distinct items in a set. This is a technique popular for tracking unique visitors over time.

To create a HyperLogLog sketch, you may provide a target error rate for distinct item estimation (default is 0.05). After that, you can either insert individual values or add collections into the sketch.

test> (def hamlet-hll (hll/into (hll/create 0.01) hamlet-tokens))
test> (def midsummer-hll (hll/into (hll/create 0.01) midsummer-tokens))
test> (count (distinct hamlet-tokens)) ;; actual
4793
test> (hll/distinct-count hamlet-hll)  ;; estimated
4868
test> (count (distinct midsummer-tokens)) ;; actual
3034
test> (hll/distinct-count midsummer-hll) ;; estimated
3018

HyperLogLog sketches may be merged if they're initialized with the same error rate.

test> (count (distinct (concat hamlet-tokens midsummer-tokens))) ;; actual
6275
test> (hll/distinct-count (hll/merge hamlet-hll midsummer-hll)) ;; estimated
6312

Similar to MinHash, HyperLogLog sketches can also provide an estimate of the Jaccard similarity between two sets.

test> (def midsummer1-hll (hll/into (hll/create 0.01) midsummer-part1))
test> (def midsummer2-hll (hll/into (hll/create 0.01) midsummer-part2))
test> (hll/jaccard-similarity midsummer1-hll midsummer2-hll)
0.2833001988071571
test> (hll/jaccard-similarity midsummer1-hll hamlet-hll)
0.201231310466139

Count-Min

bigml.sketchy.count-min provides an implementation of the Count-Min sketch, useful for estimating frequencies of arbritrary items in a stream.

To create a count-min sketch you may define the desired number of hash-bits and the number of independent hash functions. The total number of counters maintained by the sketch will be (2^hash-bits)*hashers, so choose these values carefully.

After creating a sketch, you may either insert individual values or add collections into the sketch.

In the example below we build a Count-Min sketch that uses 1500 counters to estimate frequencies for the 4800 unique tokens in Hamlet.

test> (def hamlet-cm (count-min/into (count-min/create :hash-bits 9)
                                     hamlet-tokens))
test> (count (:counters hamlet-cm))
1536
test> ((frequencies hamlet-tokens) "hamlet")
77
test> (count-min/estimate-count hamlet-cm "hamlet")
87
test> ((frequencies hamlet-tokens) "rosencrantz")
7
test> (count-min/estimate-count hamlet-cm "rosencrantz")
15

As with the other sketching algorithms, Count-Min sketches may be merged if they're initialized with the same parameters.

test> (def midsummer1-cm (count-min/into (count-min/create :hash-bits 9)
                                         midsummer-part1))
test> (def midsummer2-cm (count-min/into (count-min/create :hash-bits 9)
                                         midsummer-part2))
test> ((frequencies midsummer-tokens) "love") ;; actual count
98
test> (count-min/estimate-count (count-min/merge midsummer1-cm midsummer2-cm)
                                "love")
104

Contributing to this project

See doc/contributing.md

License

Copyright (C) 2013 BigML Inc.

Distributed under the Apache License, Version 2.0.

sketchy's People

Contributors

ashenfad avatar ieugen avatar jaor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sketchy's Issues

Min-hash constrained by Integer/MAX_VALUE

Hi guys,

When computing min-hashes, for many inputs, the signature component defaults to Integer/MAX_VALUE because the input hashes (generated by 64bit murmur hash) are all larger that Integer/MAX_VALUE.

Have you faced this issue before or is it just me?!

Thanks
Gireesh

A thank you "issue"

This isn't really an "issue", I just wanted to use the issues mechanism to thank you for making this library available. It is very useful, especially given its small size and amazingly simple implementations (it is clear you've done some creative thinking to come up with these, I've spent some time reading the code to understand it and make sure it is valid, and I am impressed).

Thanks again!

HLL: add 1 to the number of zeros?

Ok, I realize this might be wasting your time, but I can't let go and I have to ask:

In hll/insert*, line 24

zeros (Long/numberOfTrailingZeros (bit-shift-right hv offset))]
you do not add 1 to the number of trailing zeros. Reading the original HLL paper, as well as the subsequent "Understanding the HyperLogLog", it seems they always add 1 (to quote from the original paper: "equivalently one plus the length of the initial run of 0โ€™s").

This makes a big difference, since the buckets are initialized with zeros, so not adding 1 means that in many cases we won't even update the buckets.

I have to ask, because yes, of course I added inc to that line, and quickly discovered that when I do, the tests do not pass and the results are incorrect. So the code works correctly, but why?

sketchy incompatible with hadoop because of guava depenency

Hi - just wanted to let you know that I considered using this awesome looking library for a project, but had to abort because of a hadoop / guava incompatibility. This is a well known problem that you can find refrences to all / over / the / interwebs.

Maven has a solution for this called relocating classes (via the maven shade plugin) which nicely takes care of this problem. Leiningen has support for maven-shade functionality in :uberjar-merge-with which is covered well in technomancy/leiningen#973, but for whatever reason doesn't seem to support relocation.

Just wanted to share this in case anyone else has the same issue. Since for now I just needed a HyperLogLog, I'm going to instead grab AddThis' stream-lib HyperLogLog.java, but hope to come back to sketchy in the future if I can figure out how to shadow guava's siphashing classes.

0.3.0 to clojars?

Hallo,

it appears the the project version is 0.3.0 release, but the latest on clojars is 0.2.0, could you push 0.3.0 up?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.