Giter Site home page Giter Site logo

lshhdc's People

Contributors

cjauvin avatar go2starr avatar mattdennewitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lshhdc's Issues

Bug in unionfind.sets()

Sorry to flood this repo, but I think I found a bug in the unionfind.sets() function. It's easy to reproduce:

uf = UnionFind()
uf.union(0, 1)
uf.union(2, 3)
uf.union(3, 0)
assert uf.sets() == [[0, 1, 2, 3]] # rather gives [[0], [1, 2, 3]]

The bug is not in the original code, it's in the added sets() function of this repo, which incorrectly assumes that the set trees are always flat. Here's my fix:

cjauvin@1fd5d44

@escherba, @cjdd3b and @shreyu86, if you confirm that this is a bug, you'll have to update your forked repos. Ideally I would issue a PR for this base repo, but @go2starr doesn't seem very responsive.

Some questions

Hi! Thank you for this code, I've been studying it thoroughly and it is a very useful and helpful companion to the theory and algorithm sketches found in the MMD book. I have a few questions about some aspects of simhashing theory that are not entirely clear to me, would it be all right if I asked them here?

Interesting improvement (constrained clustering)

Hi fellow MinHashers (@go2starr, @escherba, @cjdd3b, @shreyu86),

I've been studying quite extensively the problem of clustering short strings (names and addresses mainly) with minhashing and I found that it is quite a hard problem. My first intuition was something like a two-pass algorithm in which the first pass would retrieve coarse clusters in an efficient way (i.e. minhashing with a loose similarity threshold). The second pass would then refine the coarse clusters and could afford to be much less efficient (possibly even brute-force-ish) because it would operate inside (hopefully small) clusters. However there are problems with that approach, and the most important one I found is that a loose sim threshold can yield massive cluster accumulation, resulting from a high number of false positives necessarily arising on a big dataset.

I then got an interesting idea, tested it, and it's been producing some very interesting results in my particular context.

As the MMD book explains, whenever a minhash collision occurs, the two objects can be considered a candidate pair, but it doesn't necessarily entail that they are similar within the desired criteria (i.e. it could be a false positive). The problem with the current implementation is that a candidate pair is considered a match, and it simply stops there. The new class I created, ConstrainedCluster, works in a similar way to the Cluster class, except that whenever two objects are to be merged together, it imposes the extra condition that they must pass an explicit similarity check. If this check doesn't pass, the objects are simply not merged. The nice thing about this similarity check is that it is user-defined, and thus can be adapted to a particular context. The default is that the Jaccard Similarity must be as high as the initial threhold of the hasher, but in my context (i.e. short strings), I've been experimenting with the Levenshtein Ratio instead, based on the troubling observation that:

s1 = 'abcdef'
s2 = 'abdcef'
print jaccard_sim(set(shingle(s1, 2)), set(shingle(s2, 2))) # 0.25
print Levenshtein.ratio(s1, s2)                             # 0.83

Moreover, I found that although more expensive than the non-constrained version (a list of possible candidates has to be scanned every time a merge is considered), it's still surprisingly efficient, I suspect probably way more than any two-pass method would ever hope to be anyway.

The ConstrainedCluster class is meant to be used in exactly the same way as its parent, and the internal implementation differences are explained in the source:

https://github.com/cjauvin/lshhdc/tree/c16f2ee6e8b913c14760752d54629967a7526f93

Like I said I've been getting excellent results with this technique, but I'd be very interested to hear your thoughts about it of course.

hshingle function

Hi!
First of all, very good LSH implementation.
I have one question, where you use hshingle function? If just not, how You handled signatures?
At the beginning, algorythm creates shingles from strings/documents and use them to build matrix (with 0/1 values). Finally, using hash functions signature matrix is created (and so on). So, I cant see this step.
Thanks for Your help :)

Collisions constrained to be within a band, or not

In the second paragraph of Section 3.4.1 of the MMD book, it says:

We can use the same hash function for all the bands, but we use a separate bucket
array for each band, so columns with the same vector in different bands will not
hash to the same bucket.

As I understand it, it's not the way your code works, as the (single) band-hashing function does not distinguish between bands, and the first band of a document could collide with the second band of another, for instance (at least in theory, because I don't know if it's really likely to happen). In other words, you seem to only have one set of buckets (for all the bands), whereas the book suggests to have as many sets as there are bands. Does that make sense?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.