go2starr / lshhdc Goto Github PK

View Code? Open in Web Editor NEW

79.0 79.0 42.0 356 KB

LSH based high dimensional clustering for sets and points

License: MIT License

Python 100.00%

lshhdc's People

Contributors

Stargazers

Watchers

lshhdc's Issues

Bug in unionfind.sets()

Sorry to flood this repo, but I think I found a bug in the unionfind.sets() function. It's easy to reproduce:

uf = UnionFind()
uf.union(0, 1)
uf.union(2, 3)
uf.union(3, 0)
assert uf.sets() == [[0, 1, 2, 3]] # rather gives [[0], [1, 2, 3]]

The bug is not in the original code, it's in the added sets() function of this repo, which incorrectly assumes that the set trees are always flat. Here's my fix:

cjauvin@1fd5d44

@escherba, @cjdd3b and @shreyu86, if you confirm that this is a bug, you'll have to update your forked repos. Ideally I would issue a PR for this base repo, but @go2starr doesn't seem very responsive.

Some questions

Hi! Thank you for this code, I've been studying it thoroughly and it is a very useful and helpful companion to the theory and algorithm sketches found in the MMD book. I have a few questions about some aspects of simhashing theory that are not entirely clear to me, would it be all right if I asked them here?

Interesting improvement (constrained clustering)

Hi fellow MinHashers (@go2starr, @escherba, @cjdd3b, @shreyu86),

I've been studying quite extensively the problem of clustering short strings (names and addresses mainly) with minhashing and I found that it is quite a hard problem. My first intuition was something like a two-pass algorithm in which the first pass would retrieve coarse clusters in an efficient way (i.e. minhashing with a loose similarity threshold). The second pass would then refine the coarse clusters and could afford to be much less efficient (possibly even brute-force-ish) because it would operate inside (hopefully small) clusters. However there are problems with that approach, and the most important one I found is that a loose sim threshold can yield massive cluster accumulation, resulting from a high number of false positives necessarily arising on a big dataset.

I then got an interesting idea, tested it, and it's been producing some very interesting results in my particular context.

As the MMD book explains, whenever a minhash collision occurs, the two objects can be considered a candidate pair, but it doesn't necessarily entail that they are similar within the desired criteria (i.e. it could be a false positive). The problem with the current implementation is that a candidate pair is considered a match, and it simply stops there. The new class I created, ConstrainedCluster, works in a similar way to the Cluster class, except that whenever two objects are to be merged together, it imposes the extra condition that they must pass an explicit similarity check. If this check doesn't pass, the objects are simply not merged. The nice thing about this similarity check is that it is user-defined, and thus can be adapted to a particular context. The default is that the Jaccard Similarity must be as high as the initial threhold of the hasher, but in my context (i.e. short strings), I've been experimenting with the Levenshtein Ratio instead, based on the troubling observation that:

s1 = 'abcdef'
s2 = 'abdcef'
print jaccard_sim(set(shingle(s1, 2)), set(shingle(s2, 2))) # 0.25
print Levenshtein.ratio(s1, s2)                             # 0.83

Moreover, I found that although more expensive than the non-constrained version (a list of possible candidates has to be scanned every time a merge is considered), it's still surprisingly efficient, I suspect probably way more than any two-pass method would ever hope to be anyway.

The ConstrainedCluster class is meant to be used in exactly the same way as its parent, and the internal implementation differences are explained in the source:

https://github.com/cjauvin/lshhdc/tree/c16f2ee6e8b913c14760752d54629967a7526f93

Like I said I've been getting excellent results with this technique, but I'd be very interested to hear your thoughts about it of course.

License?

@go2starr what is the license for your code?

hshingle function

Hi!
First of all, very good LSH implementation.
I have one question, where you use hshingle function? If just not, how You handled signatures?
At the beginning, algorythm creates shingles from strings/documents and use them to build matrix (with 0/1 values). Finally, using hash functions signature matrix is created (and so on). So, I cant see this step.
Thanks for Your help :)

Collisions constrained to be within a band, or not

In the second paragraph of Section 3.4.1 of the MMD book, it says:

We can use the same hash function for all the bands, but we use a separate bucket
array for each band, so columns with the same vector in diﬀerent bands will not
hash to the same bucket.

As I understand it, it's not the way your code works, as the (single) band-hashing function does not distinguish between bands, and the first band of a document could collide with the second band of another, for instance (at least in theory, because I don't know if it's really likely to happen). In other words, you seem to only have one set of buckets (for all the bands), whereas the book suggests to have as many sets as there are bands. Does that make sense?

go2starr / lshhdc Goto Github PK

lshhdc's People

Contributors

Stargazers

Watchers

Forkers

lshhdc's Issues

Bug in unionfind.sets()

Some questions

Interesting improvement (constrained clustering)

License?

hshingle function

Collisions constrained to be within a band, or not

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent