ekzhu / datasketch Goto Github PK

View Code? Open in Web Editor NEW

2.4K 48.0 289.0 5.71 MB

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

Home Page: https://ekzhu.github.io/datasketch

License: MIT License

Python 99.73% Shell 0.27%

python lsh-forest jaccard-similarity hyperloglog lsh minhash weighted-quantiles top-k search data-sketches

datasketch's People

Contributors

Stargazers

Watchers

Forkers

smallcattom xsongx chagge chubbymaggie shuimu indera cytora otso qsong4 xlong88 jcardonnet beifeizhou zhenyisx amano-ginji tempbottle zubinaryarupeepower kanefu tpnguyen yxzf blkstone rojas70 qian2729 yosoyubik chianingwang sandeep37 achillessaxby windhaunting fpug tonellotto tomzhang rjmillerlab damon-wyg pombredanne titusz mikewlange mplearning lleqsnoom resurgo-genetics gavinzjchao best-daniel githubutilities karthi2016 akshseh jiayuanding100 fmacias64 purpleturtle2017 ramaswamym1987 dbsiegel arbazkhan002 sugiki melcutz optimus1009 zhensongqian srividya22 sdwfrost josemacedo ashishtanwer vmarkovtsev siddta wotaoyanjiaobu olijouve zhangjunqiang fs302 digideskio gosom insikk beimingmaster binwone leezqcst jankim andy-wagner ansvver gptcod dychangfeng otuncer pandasmx wangxggc ehrencrona jgraef stocksasia junjiemao hosanaufrrj2014 karlqu1990 coderfi yyzreal acteq kylinliu matiasbastos aastafiev ananthc kevinmann13 charily xingyezhi wsgan001 alextavgen spandanthakur wonder2025 bluelabsio yuzhao12 wei-he

datasketch's Issues

LSH Java Implementation

Hi - I am planning to use MinHash algorithm in JAVA. Has anyone done it before?

What do you guys recommend?

Should i use python implementation and call Python methods from JAVA?
Or use any JAVA-LSH libraries.

Get raw hash value

Hi,
How can send a string and get the raw hash value from lean minhash. i want use it in a custom way.
please write a sample for this.
thanks

Redis on MinHashLSHForest

Please add the support to Redis storage in MinHashLSHForest

Minhash LSH Forest Top-K results differs from MinHash LSH

Hi, after I switch from LSH to LSHforest. Although I set the same num_perm for them, I found the results were still quite different. For example, the top-40 I got from LSHforest is totally different from the sorted top 40 I got from LSH (e.g. I got a list of results from one LSH query, and then compute the weighted jaccard similarity between the query item and each of the result. After that i sort the results ordered from most similar to least similar and take the top 40). Could you check the performance for the LSHforest on weighted minhash?
Besides that, if I use MinhashLSH instead, can I control the number of result returned for each query? Because our dataset is very large, the number of result is also very large but we only need the top few similar items. Calculating the distance between the query item and each of the returned result is quite a big cost.

MinHash LSH Forest Accuracy

Hey,
I see this has been raised already in #42 , but even on what seems a very simple example, I'm getting some odd behavior using the MinHash LSH Forest.

I'm trying to do simple approximate string matching to collect citation data, and as you can see in the example, for even just 10 data-points, top-k performance deviates from the directly estimated Jaccard similarity.

Am I missing something maybe?
Thanks

calculation of hashvalues

Hi ekzhu,
I was trying to look into your code and found a question regarding the calculation of hashvalues.
In your weighted_minhash.py,

datasketch/datasketch/weighted_minhash.py

Line 134 in ae9edad

hashvalues[i][0], hashvalues[i][1] = k, int(t[k])

Here you add the pair k and int(t[k]) to the hashvalues. After some random checking, I always find the int(t[k]) is 0 (or very few time it is -1). Can I the assume that the int(t[k]) value is always 0? Looking forward to hearing from you. Thanks!

No result for top-k

I was trying to use lshforest to get top-k matching results for a list of floats.

    m = [MinHash(num_perm=256) for i in range(len(features))]
    forest = MinHashLSHForest(num_perm=256)
    
    for i, ID, feature in zip(range(len(features)),keys,features):
        m[i].update(struct.pack("{}f".format(len(feature)), *feature))
        forest.add(ID, m[i])
    forest.index()

Here, ID is a string and feature is a list of floats. I saw #26 and converted the floats to bytes.

Below is the code for query in anther python file.

minhash = MinHash(num_perm=256)
minhash.update(struct.pack("{}f".format(len(query_features)), *query_features))

result = forest.query(minhash, 50)

The query_features has the same length as feature.
But the result is an empty list, is there any thing wrong with my code? Thanks

Why can't got the top k ?

Hi, I want to got the top k element with MinHashLSH but failed. For example, I set 'k=3', but I got ('result: ', ['21', '28', '51', '1', '82', '3', '91', '69', '86', '85']), whose length is larger than 3. My demo is like below:
def query_topk(l, query_doc, k):
forest= MinHashLSHForest(num_perm=256)
count=0
for i in l:
forest.add(str(count), i)
count += 1
forest.index()
result = forest.query(query_doc, k)
return result

l : list of MinHash, query_doc: a MinHash
Is there anything wrong?
By the way, does the input must be a list of string? What if my input is a vector?
Thanks for your patience,
And another question, does this realization just support for texts? if each of my input is a list of float, i.e.[[1,2,3],[1.2,2.3,2.1]], can this work perfectly?

Sincerely,

redis.exceptions.ResponseError: wrong number of arguments for 'rpush' command (Redis 2.4.6 + Windows 10)

I am working with redis and LSH was going through your docs here. Below is my code

from datasketch import MinHash, MinHashLSH

set1 = {'minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
		'datasets'}
set2 = {'minhash', 'is', 'a', 'probability', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
		'documents'}
set3 = {'minhash', 'is', 'probability', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
		'documents'}

m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
for d in set1:
	m1.update(d.encode('utf8'))
for d in set2:
	m2.update(d.encode('utf8'))
for d in set3:
	m3.update(d.encode('utf8'))

# Create LSH index
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("Approximate neighbours with Jaccard similarity > 0.5", result)


lsh = MinHashLSH(
   threshold=0.5, num_perm=128, storage_config={
	  'type': 'redis',
	  'redis': {'host': 'localhost', 'port': 6379}
   })

data_list = [("m1", m1), ("m2", m2), ("m3", m3)]

with lsh.insertion_session() as session:
   for key, minhash in data_list:
	  session.insert(key, minhash)

Below is what got printed to console:
redis.exceptions.ResponseError: Command # 2 (RPUSH b'dvpuojwmwwe_keys\x80\x03K\x00.'
// between these there are a lot of lines that print out encoded stuff and then
of pipeline caused error: wrong number of arguments for 'rpush' command

MinHashLSH for spark

Hi - I am planning to set up ML pipeline in Spark. I am using this library for similarity check.

Is it possible to use this python package with pyspark ?

How to change a key in a MinHash LSH?

In your document， there is the code like:
with lsh.insertion_session() as session:
for key, minhash in data_list:
session.insert(key, minhash)

But when I want to change my key,how to change it

Feature request: SuperMinHash

I've recently found out about SuperMinHash. Are there any plans to provide a reference implementation here? I would volunteer to add it if there are no official plans.

Implement hash method for LeanMinHash class

Hello,

as the lean minhash objects are immutable, can you implement special method __hash__ on them? That would allow to use them easily in collections that require hashable objects.

`struct.unpack_from` doesn't work with `bytearray`

I encounter following error while using pickle.

Traceback (most recent call last):
  File "srj.py", line 14, in <module>
    x1 = pickle.loads(x)
  File "/usr/lib/python2.7/pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
  File "/usr/local/lib/python2.7/dist-packages/datasketch-0.1.26-py2.7.egg/datasketch/minhash.py", line 165, in __setstate__
    seed, num_perm = struct.unpack_from('qi', buffer, 0)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray

From this question, I understand that its an issue in struct.unpack_from
http://stackoverflow.com/questions/15467009/struct-unpack-from-doesnt-work-with-bytearray

Serialization does not store hashobj, and deserialization always results in using sha1

There is a bug in the serialization / deserialization - hashobj is not stored, so it will always revert back to sha1 (the init default argument) when the class is deserialized.

Check buckets of LSH MinHash

I want to have a rough clustering of input, that is, input sets that are similar to each other in terms of Jaccard coefficient should be grouped together. I think lsh may be able to accomplish such task since it hashes items similar to each other in the same bucket. I wonder if I can check which items are grouped together. I tried the following way but I do not think it is a good one.
lsh.hashtables[6]._dict.values()
Any help or suggestion is much appreaciated.

error work in README.md line:51

error MinHash method in README.md line:51, m1.jacard(m2)
should be : m1.jaccard(m2)

Jaccard calculation

The MinHash.jaccard method calculates the size of the intersection divided by number of hash values. Shouldn't that be the size of intersection divided by the size of the union? See: https://en.wikipedia.org/wiki/Jaccard_index

WeightedMinHashGenerator RuntimeWarning

When I execute the example code from Weigted MinHash documentation, I get:

wm2 = wmg.minhash(v2)
/usr/local/lib/python3.5/dist-packages/datasketch/weighted_minhash.py:92: RuntimeWarning: divide by zero encountered in log
  t = np.floor((np.log(v) / self.rs[i]) + self.betas[i])

No exec. times for benchmarking ground truth

Missing times.append(duration) at the end of benchmark_groud_truth.

It would also be great to rename the method to "benchmark_ground_truth" for consistency. :)

TypeError: randint() got an unexpected keyword argument 'dtype'

Error happened when I just copied the example from the office site.
https://ekzhu.github.io/datasketch/documentation.html#minhash

traceback:
Traceback (most recent call last):
File "/Users/caijianyu/sites/scouter/script/proj/minhash.py", line 10, in
m1 = MinHash(num_perm=128)
File "/Library/Python/2.7/site-packages/datasketch/minhash.py", line 82, in init
for _ in range(num_perm)], dtype=np.uint64).T
File "mtrand.pyx", line 812, in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:6557)
TypeError: randint() got an unexpected keyword argument 'dtype'

Test pickle fails

stefano@ubuntu:~/datasketch-master$ nosetests -s test/
............................................E
======================================================================
ERROR: test_pickle (minhash_test.TestbBitMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stefano/datasketch-master/test/minhash_test.py", line 217, in test_pickle
    bm2 = pickle.loads(pickle.dumps(bm))
  File "/usr/lib/python2.7/pickle.py", line 1382, in loads
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
    setstate(state)
  File "/home/stefano/datasketch-master/datasketch/b_bit_minhash.py", line 105, in __setstate__
    struct.unpack_from(self._serial_fmt_params, buffer, 0)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray

----------------------------------------------------------------------
Ran 45 tests in 1.067s

FAILED (errors=1)
stefano@ubuntu:~/datasketch-master$ python
Python 2.7.3 (default, Jun 22 2015, 19:33:41) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

Similarity matrix

Hi, how can I generate a similarity matrix by using minhash LSH? Minhash seems to compute only the jaccard comparison while minhash LSH outputs a list of candidates according to the similarity threshold set. I would like to use the similarity matrix for further clustering, and would like to know if this is possible with this package.

redis not persisting

I was able to add MinHash into redis backed lsh. But when try to reload lsh from another session all the keys/value I have added are not there anymore. Is it possible that redis is not persisting key/value? I was under impression that purpose of using redis backend is to be able to reuse from other processes. Thanks.

lsh = MinHashLSH(
 threshold=0.5, num_perm=128, storage_config={
    'type': 'redis',
    'base_name': 'unique_name_6ac4fg',
    'redis': {'host': 'localhost', 'port': 6379}
 })

doc = "Tales of Berseria "
mh = MinHash(num_perm=128)
for d in doc.split():
    mh.update(d.encode('utf8'))
lsh.insert("d1", mh)

lsh.__contains__("d1")

True

Now load lsh from another session and key/values are gone:

lsh = MinHashLSH(
 threshold=0.5, num_perm=128, storage_config={
    'type': 'redis',
    'base_name': 'unique_name_6ac4fg',
    'redis': {'host': 'localhost', 'port': 6379}
 })

lsh.__contains__("d1")

False

Feature request: more distance measures

Hi, I found that in your implementation, the distance measures are only jaccard distance and weighted jaccard distance.

But for the weighted jaccard distance, the vectors used for comparison should not be negative. But in some cases it may have negative values in the vectors (e.g. after compressing a very high dimension vector using PSA).

Do you have any plan to add more distance measures like euclidean and cosine distance? Thanks.

How to speedup MinHash LSH indexing for hundreds of millions of MinHashes?

Hi, thanks for the amazing package!

Are there any ways to speedup the indexing for similarity search for hundreds of millions MinHashes?
I do it like that

        lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)

        for i, row in enumerate(spamreader):
            i += 1
            if i % 50000 == 0:
                with lsh.insertion_session() as session:
                    for key, minhash in collector:
                        session.insert(key, minhash)
                collector.clear()
                print("Inserted ", i)

            id = row[0]
            items = row[1].strip().split(",")  # typically a short list

            m = MinHash(num_perm=num_perm)

            for item in items:
                m.update(item.encode("utf-8"))

            collector.append((id, m))

...and it takes quite a while, which doesn't allow me to carry out my experiments.

NUM_PERM = 128
THRESHOLD = 0.6

crash when run with python to C size mismatch

when running the example code in the site I get the following error :
OverflowError Python int too large for C long
the environment is
windows 7 64 bit, Python 3.5.1 |Anaconda 4.0.0
library version 0.2.3

MinHashLSHForest.query: get jaccard of the top-k

Hi,
the query method of MinHashLSHForest returns only the keys, it is possible to know also the jaccard between the given minhash and the top-k results?
Thanks :)

Feature Request / Question: Non-text unweighted minhash

Is it possible to hash data encoded as binary "array strings" (not sure if this makes sense in python, in C I'd encode it using integer or maybe boolean arrays)? For example:

a = '1000101001'
b = '1000100101'

Now I would like to get the jaccard distance between those two data points. How would I minhash those values? Would this need a change in the implementation?

How to achieve multiprocessing with MinHash LSH

I wanted to query a MinHash LSH Forest index from multiple processes to obtain the top-1 result from an operation, but it appears that only MinHash LSH supports the storage_config option which I assume is what would allow me to do this from reading https://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale. What is the correct way to query an LSH index in parallel? My current workflow uses Python Dask (a parallel version on Pandas). It goes something like this,

Creating the LSH index:

lsh = MinHashLSHForest(num_perm=128)                                                                                                                                                               
minhashes = {}                                                                                                                                                                                          
for c, i in enumerate(entity_names):
  minhash = MinHash(num_perm=128)
  for item in i:
    minhash.update(item.encode('utf-8'))
  lsh.add(c,minhash)
  minhashes[c] = [i,minhash]
lsh.index()

Then define the function which Dask calls to return results from LSH:

def getnn(querystring):
    qhash = MinHash(num_perm=128)
    for shingle in querystring:
        qhash.update(shingle.encode('utf-8'))
    idx = lsh.query(qhash,1)[0]
    result = (minhashes[idx][0],int(round(qhash.jaccard(minhashes[idx][1])*100)))
    return result

Then apply the function across a series in Dask:

ddf['lsh_result'] = ddf.input_name.apply(lambda x: getnn(x),meta=('lsh_result',object))

This operation should distribute across ~40 processes, but currently only works in 1 process.

Source distribution not available in pypi

This might seem an odd issue, though, I am unable to install the library in the environment, where usage of wheels is prohibited (i. e. all the python packages are installed from source). The pip says there is no available source distributions except for version 0.2.6: https://gist.github.com/anonymous/0643b6edf0def8d8f300dbe0b5f0b976. Other libraries usually upload sources, and are installed ok. Can you please upload source distribution to pypi?

Typo in get_newsgroup_data of lsh_benchmark

There is a typo in the class specified in the title. When assigning the value of 'newsgroup', you should remove "headers" instead of "header", otherwise they remain.

(benchmark/lsh_benchmark.py line 21)

python 3.6 issue

When I use python 3.6, getting an error message like below "expected a bytes-like object, str found". So when I put b in front of the basename I was able to create lsh, but when I query lsh nothing is there from previous addition. Any help is appreciated. Thanks.

lsh = MinHashLSH(
 threshold=0.9, num_perm=128, storage_config={
    'type': 'redis',
    'basename': 'unique_name_6ac4fg',
    'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
 })

lsh = MinHashLSH(
 threshold=0.9, num_perm=128, storage_config={
    'type': 'redis',
    'basename': b'unique_name_6ac4fg',
    'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
 })

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-43dabc764ea9> in <module>()
     15     'type': 'redis',
     16     'basename': 'ij',
---> 17     'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
     18  })
     19 

~\Anaconda2\envs\py36\lib\site-packages\datasketch\lsh.py in __init__(self, threshold, num_perm, weights, params, storage_config, prepickle)
    119         self.hashtables = [
    120             unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', bytes([i])]))
--> 121             for i in range(self.b)]
    122         self.hashranges = [(i*self.r, (i+1)*self.r) for i in range(self.b)]
    123         self.keys = ordered_storage(storage_config, name=b''.join([basename, b'_keys']))

~\Anaconda2\envs\py36\lib\site-packages\datasketch\lsh.py in <listcomp>(.0)
    119         self.hashtables = [
    120             unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', bytes([i])]))
--> 121             for i in range(self.b)]
    122         self.hashranges = [(i*self.r, (i+1)*self.r) for i in range(self.b)]
    123         self.keys = ordered_storage(storage_config, name=b''.join([basename, b'_keys']))

TypeError: sequence item 0: expected a bytes-like object, str found

how to restore minhashes from MinHashLSH with redis backend?

I what to find duplicates,so I need pop a hash then compare left hashes.

or should I make a copy of original data to redis?
or just not use the redis storge wait the dataset feed up then use MinHashLSH

is different hashobj will cause huge different profermence?

is different hashobj will cause huge different profermence?
like hashlib.sha1
or hashlib.md5
will effect CPU or result clusters?
thanks

API or Interface

Would it be possible to include in the documentation the methods for each of the modules in dataSketch and how they are called?

Elasticsearch storage

I need to build image search database, using neural network to extract features and using weighted min hash, LSH to find similarity. How can I use elastic search for storing hash keys? I really new to this so please give me some guidance. May I submit pull request for es storage option?

How to change Redis storage configuration in an existing MinHash LSH

Hey,nice to meet you!
In your code, there is some code to create lsh like:
lsh = MinHashLSH(threshold=0.91, num_perm=128, storage_config={
'type': 'redis',
'redis': {'host': '10.8.35.7', 'port': 6379, 'db': 9}, 'name': 'abcdefghij'}, prepickle=True)

If such lsh had create,and some data had already add into it, how can I change this IP?

Memory usage

Thanks for the well written library.

The library is using too much memory (>1GB) for around 300.000 LeanMinHashes.

a. You use maxhash of 32 bits but your np arrays are using dtype=uint64.
I do not know how numpy memory allocation works but I think it maybe allocates double the needed size.
b. You have an implementation of bMinHash. Does it work? If yes can you give me an example?

Question: Setting Expiry Time for Entries in Redis LSH?

Hi,

First of all thanks for this great library :)

I wanted to know if there is any way to set an expiry time for entries being added into an LSH connected to Redis? Or any way to keep a default expiry time for all keys?

Should I instead set a maxmemory policy for my Redis with an allkeys-lru or allkeys-random eviction policy? Is this recommended or will it cause issues with the LSH?

Regards,
Spandan

await AsyncMinHashLSH syntax error on python 3.6.5 anaconda

Following example is not working with 1.2.7 , even after changing the import to
from datasketch.experimental.aio.lsh import AsyncMinHashLSH

I am getting error , invalid syntax for await. I am using python 3.6.5

lsh = await AsyncMinHashLSH(storage_config=_storage, threshold=0.5, num_perm=16)
^
SyntaxError: invalid syntax

https://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale

from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash, MinHashLSH


_storage = {'type': 'aioredis', 'redis': {'host': 'localhost', 'port': 6379}}

lsh = await AsyncMinHashLSH(storage_config=_storage, threshold=0.5, num_perm=16)
m1 = MinHash(16)
m1.update('a'.encode('utf8'))
m2 = MinHash(16)
m2.update('b'.encode('utf8'))
await lsh.insert('a', m1)
await lsh.insert('b', m2)
print(await lsh.query(m1))
print(await lsh.query(m2))
lsh.close()

AsyncMinHashLSH threshold dynamic update

@ekzhu Hi, my data all insert into MongoDB with aiomongo.
is there anyway to using dynamic threhold after all data insert into Mongo?
thanks alot.

Accessing Minhash object

Hello,

I stumbled across your code while looking up Minhashing. I was wondering if there is a way to access the Minhash object to retrieve the hashes?

thanks,
ara

redis storage question?

set1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between', 'datasets'])

m1 = MinHash(num_perm=128)
for d in set1: m1.update(d.encode('utf8'))

lsh.insert("m1", m1)
lsh = MinHashLSH(threshold=0.9, num_perm=128, storage_config={ 'type': 'redis', 'redis': {'host': 'localhost', 'port': 6379,'db': 1},'name':1})

lsh1 = MinHashLSH(threshold=0.9, num_perm=128, storage_config={ 'type': 'redis', 'redis': {'host': 'localhost', 'port': 6379,'db': 1},'name':1})

lsh.keys.size() is 1 but lsh1.keys.size() is 0
Why?

what is wrong with this 'Error 111 connecting to localhost:6379. Connection refused.'

When I add more data into lsh, 'Error 111 connecting to localhost:6379. Connection refused',this error was created, what is the problem and how can I slove it?Thank you so much

Memory handling for LSH index with billions of items

Hello, I'm really liking this library, but have recently started running into RAM limitations trying to handle billions of MinhashLSH index members on a 16GB quadcore OSX.

Right now, I'm essentially looping over my objects, hashing each, and adding the hash result (of class Minhash) to one large MinhashLSH index for subsequent querying. Is there a less memory intensive way of accomplishing a similar goal? Ideally, I'd like to post the Minhash results to a db then query the db for matches so as to afford much higher scalability.

I'm happy to help put some of this in place if it's not already a part of this toolkit, but wanted to ask in the meantime to see if your team had any immediate thoughts. In any event, thanks again for this awesome work.

Choosing insertion batch size for MinHashLSH \w Redis backend

Hello!

Thank you for the great lib!

Could you, please, share any hints on the optimal batchsize for inserting hashes into MinHashLSH with redis backend. Indexing is pretty slow for my current settings:

NUM_PERM = 128
THRESHOLD = 0.3

if count % 10000 == 0:
    print("Loaded", count)
    with self.lsh.insertion_session() as session:
        for key, minhash in collector:
            session.insert(key, minhash)

Make MinHash hashable

This class would be far easier to use if the objects were hashable. As this is a hashing class you would only have to expose the hash value as python interface with __hash__.

>>> from datasketch import MinHash
>>> m=MinHash()
>>> set([m])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'MinHash'

Store MinHashes for later queries

Hi - I am using MinHashLSH for querying duplicate documents. This is exact flow of my use case.

I have around 100K documents.
when i open any document, it should show all other documents which are 90% match.

For this, i am first calculating minhashes for each document and then adding it below LSH object.

MinHashLSH(num_perm=perms,threshold=0.9)
Issue is - this min hashes creation process itself takes around 470 sec. So i can't create it every time i query new document.

So I am planning to store this LSH object in disk for future queries and update it may be once in a day/week.

Did any one try this before or know any better way to handle this type of use case?

Thank You.

Please make your project citeable

As stated in https://guides.github.com/activities/citable-code/ there is a way to get DOI for your open source project on GitHub. Please do! I am citing Datasketch in my paper and really need DOI.