ekzhu / datasketch Goto Github PK
View Code? Open in Web Editor NEWMinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Home Page: https://ekzhu.github.io/datasketch
License: MIT License
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Home Page: https://ekzhu.github.io/datasketch
License: MIT License
Hi - I am planning to use MinHash algorithm in JAVA. Has anyone done it before?
What do you guys recommend?
Hi,
How can send a string and get the raw hash value from lean minhash. i want use it in a custom way.
please write a sample for this.
thanks
Please add the support to Redis storage in MinHashLSHForest
Hi, after I switch from LSH to LSHforest. Although I set the same num_perm for them, I found the results were still quite different. For example, the top-40 I got from LSHforest is totally different from the sorted top 40 I got from LSH (e.g. I got a list of results from one LSH query, and then compute the weighted jaccard similarity between the query item and each of the result. After that i sort the results ordered from most similar to least similar and take the top 40). Could you check the performance for the LSHforest on weighted minhash?
Besides that, if I use MinhashLSH instead, can I control the number of result returned for each query? Because our dataset is very large, the number of result is also very large but we only need the top few similar items. Calculating the distance between the query item and each of the returned result is quite a big cost.
Hey,
I see this has been raised already in #42 , but even on what seems a very simple example, I'm getting some odd behavior using the MinHash LSH Forest.
I'm trying to do simple approximate string matching to collect citation data, and as you can see in the example, for even just 10 data-points, top-k performance deviates from the directly estimated Jaccard similarity.
Am I missing something maybe?
Thanks
Hi ekzhu,
I was trying to look into your code and found a question regarding the calculation of hashvalues.
In your weighted_minhash.py,
datasketch/datasketch/weighted_minhash.py
Line 134 in ae9edad
I was trying to use lshforest to get top-k matching results for a list of floats.
m = [MinHash(num_perm=256) for i in range(len(features))]
forest = MinHashLSHForest(num_perm=256)
for i, ID, feature in zip(range(len(features)),keys,features):
m[i].update(struct.pack("{}f".format(len(feature)), *feature))
forest.add(ID, m[i])
forest.index()
Here, ID is a string and feature is a list of floats. I saw #26 and converted the floats to bytes.
Below is the code for query in anther python file.
minhash = MinHash(num_perm=256)
minhash.update(struct.pack("{}f".format(len(query_features)), *query_features))
result = forest.query(minhash, 50)
The query_features has the same length as feature.
But the result is an empty list, is there any thing wrong with my code? Thanks
Hi, I want to got the top k element with MinHashLSH but failed. For example, I set 'k=3', but I got ('result: ', ['21', '28', '51', '1', '82', '3', '91', '69', '86', '85']), whose length is larger than 3. My demo is like below:
def query_topk(l, query_doc, k):
forest= MinHashLSHForest(num_perm=256)
count=0
for i in l:
forest.add(str(count), i)
count += 1
forest.index()
result = forest.query(query_doc, k)
return result
l : list of MinHash, query_doc: a MinHash
Is there anything wrong?
By the way, does the input must be a list of string? What if my input is a vector?
Thanks for your patience,
And another question, does this realization just support for texts? if each of my input is a list of float, i.e.[[1,2,3],[1.2,2.3,2.1]], can this work perfectly?
Sincerely,
I am working with redis and LSH was going through your docs here. Below is my code
from datasketch import MinHash, MinHashLSH
set1 = {'minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
'datasets'}
set2 = {'minhash', 'is', 'a', 'probability', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
'documents'}
set3 = {'minhash', 'is', 'probability', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between',
'documents'}
m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
for d in set1:
m1.update(d.encode('utf8'))
for d in set2:
m2.update(d.encode('utf8'))
for d in set3:
m3.update(d.encode('utf8'))
# Create LSH index
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("Approximate neighbours with Jaccard similarity > 0.5", result)
lsh = MinHashLSH(
threshold=0.5, num_perm=128, storage_config={
'type': 'redis',
'redis': {'host': 'localhost', 'port': 6379}
})
data_list = [("m1", m1), ("m2", m2), ("m3", m3)]
with lsh.insertion_session() as session:
for key, minhash in data_list:
session.insert(key, minhash)
Below is what got printed to console:
redis.exceptions.ResponseError: Command # 2 (RPUSH b'dvpuojwmwwe_keys\x80\x03K\x00.'
// between these there are a lot of lines that print out encoded stuff and then
of pipeline caused error: wrong number of arguments for 'rpush' command
Hi - I am planning to set up ML pipeline in Spark. I am using this library for similarity check.
Is it possible to use this python package with pyspark ?
In your document, there is the code like:
with lsh.insertion_session() as session:
for key, minhash in data_list:
session.insert(key, minhash)
But when I want to change my key,how to change it
I've recently found out about SuperMinHash. Are there any plans to provide a reference implementation here? I would volunteer to add it if there are no official plans.
Hello,
as the lean minhash objects are immutable, can you implement special method __hash__
on them? That would allow to use them easily in collections that require hashable objects.
I encounter following error while using pickle.
Traceback (most recent call last):
File "srj.py", line 14, in <module>
x1 = pickle.loads(x)
File "/usr/lib/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
setstate(state)
File "/usr/local/lib/python2.7/dist-packages/datasketch-0.1.26-py2.7.egg/datasketch/minhash.py", line 165, in __setstate__
seed, num_perm = struct.unpack_from('qi', buffer, 0)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray
From this question, I understand that its an issue in struct.unpack_from
http://stackoverflow.com/questions/15467009/struct-unpack-from-doesnt-work-with-bytearray
There is a bug in the serialization / deserialization - hashobj is not stored, so it will always revert back to sha1 (the init default argument) when the class is deserialized.
I want to have a rough clustering of input, that is, input sets that are similar to each other in terms of Jaccard coefficient should be grouped together. I think lsh may be able to accomplish such task since it hashes items similar to each other in the same bucket. I wonder if I can check which items are grouped together. I tried the following way but I do not think it is a good one.
lsh.hashtables[6]._dict.values()
Any help or suggestion is much appreaciated.
error MinHash method in README.md line:51, m1.jacard(m2)
should be : m1.jaccard(m2)
The MinHash.jaccard method calculates the size of the intersection divided by number of hash values. Shouldn't that be the size of intersection divided by the size of the union? See: https://en.wikipedia.org/wiki/Jaccard_index
When I execute the example code from Weigted MinHash documentation, I get:
wm2 = wmg.minhash(v2)
/usr/local/lib/python3.5/dist-packages/datasketch/weighted_minhash.py:92: RuntimeWarning: divide by zero encountered in log
t = np.floor((np.log(v) / self.rs[i]) + self.betas[i])
Missing times.append(duration) at the end of benchmark_groud_truth.
It would also be great to rename the method to "benchmark_ground_truth" for consistency. :)
Error happened when I just copied the example from the office site.
https://ekzhu.github.io/datasketch/documentation.html#minhash
traceback:
Traceback (most recent call last):
File "/Users/caijianyu/sites/scouter/script/proj/minhash.py", line 10, in
m1 = MinHash(num_perm=128)
File "/Library/Python/2.7/site-packages/datasketch/minhash.py", line 82, in init
for _ in range(num_perm)], dtype=np.uint64).T
File "mtrand.pyx", line 812, in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:6557)
TypeError: randint() got an unexpected keyword argument 'dtype'
stefano@ubuntu:~/datasketch-master$ nosetests -s test/
............................................E
======================================================================
ERROR: test_pickle (minhash_test.TestbBitMinHash)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/stefano/datasketch-master/test/minhash_test.py", line 217, in test_pickle
bm2 = pickle.loads(pickle.dumps(bm))
File "/usr/lib/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1217, in load_build
setstate(state)
File "/home/stefano/datasketch-master/datasketch/b_bit_minhash.py", line 105, in __setstate__
struct.unpack_from(self._serial_fmt_params, buffer, 0)
TypeError: unpack_from() argument 1 must be string or read-only buffer, not bytearray
----------------------------------------------------------------------
Ran 45 tests in 1.067s
FAILED (errors=1)
stefano@ubuntu:~/datasketch-master$ python
Python 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Hi, how can I generate a similarity matrix by using minhash LSH? Minhash seems to compute only the jaccard comparison while minhash LSH outputs a list of candidates according to the similarity threshold set. I would like to use the similarity matrix for further clustering, and would like to know if this is possible with this package.
I was able to add MinHash into redis backed lsh. But when try to reload lsh from another session all the keys/value I have added are not there anymore. Is it possible that redis is not persisting key/value? I was under impression that purpose of using redis backend is to be able to reuse from other processes. Thanks.
lsh = MinHashLSH(
threshold=0.5, num_perm=128, storage_config={
'type': 'redis',
'base_name': 'unique_name_6ac4fg',
'redis': {'host': 'localhost', 'port': 6379}
})
doc = "Tales of Berseria "
mh = MinHash(num_perm=128)
for d in doc.split():
mh.update(d.encode('utf8'))
lsh.insert("d1", mh)
lsh.__contains__("d1")
True
Now load lsh from another session and key/values are gone:
lsh = MinHashLSH(
threshold=0.5, num_perm=128, storage_config={
'type': 'redis',
'base_name': 'unique_name_6ac4fg',
'redis': {'host': 'localhost', 'port': 6379}
})
lsh.__contains__("d1")
False
Hi, I found that in your implementation, the distance measures are only jaccard distance and weighted jaccard distance.
But for the weighted jaccard distance, the vectors used for comparison should not be negative. But in some cases it may have negative values in the vectors (e.g. after compressing a very high dimension vector using PSA).
Do you have any plan to add more distance measures like euclidean and cosine distance? Thanks.
Hi, thanks for the amazing package!
Are there any ways to speedup the indexing for similarity search for hundreds of millions MinHashes?
I do it like that
lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)
for i, row in enumerate(spamreader):
i += 1
if i % 50000 == 0:
with lsh.insertion_session() as session:
for key, minhash in collector:
session.insert(key, minhash)
collector.clear()
print("Inserted ", i)
id = row[0]
items = row[1].strip().split(",") # typically a short list
m = MinHash(num_perm=num_perm)
for item in items:
m.update(item.encode("utf-8"))
collector.append((id, m))
...and it takes quite a while, which doesn't allow me to carry out my experiments.
NUM_PERM = 128
THRESHOLD = 0.6
when running the example code in the site I get the following error :
OverflowError Python int too large for C long
the environment is
windows 7 64 bit, Python 3.5.1 |Anaconda 4.0.0
library version 0.2.3
Hi,
the query method of MinHashLSHForest returns only the keys, it is possible to know also the jaccard between the given minhash and the top-k results?
Thanks :)
Is it possible to hash data encoded as binary "array strings" (not sure if this makes sense in python, in C I'd encode it using integer or maybe boolean arrays)? For example:
a = '1000101001'
b = '1000100101'
Now I would like to get the jaccard distance between those two data points. How would I minhash those values? Would this need a change in the implementation?
I wanted to query a MinHash LSH Forest index from multiple processes to obtain the top-1 result from an operation, but it appears that only MinHash LSH supports the storage_config option which I assume is what would allow me to do this from reading https://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale. What is the correct way to query an LSH index in parallel? My current workflow uses Python Dask (a parallel version on Pandas). It goes something like this,
Creating the LSH index:
lsh = MinHashLSHForest(num_perm=128)
minhashes = {}
for c, i in enumerate(entity_names):
minhash = MinHash(num_perm=128)
for item in i:
minhash.update(item.encode('utf-8'))
lsh.add(c,minhash)
minhashes[c] = [i,minhash]
lsh.index()
Then define the function which Dask calls to return results from LSH:
def getnn(querystring):
qhash = MinHash(num_perm=128)
for shingle in querystring:
qhash.update(shingle.encode('utf-8'))
idx = lsh.query(qhash,1)[0]
result = (minhashes[idx][0],int(round(qhash.jaccard(minhashes[idx][1])*100)))
return result
Then apply the function across a series in Dask:
ddf['lsh_result'] = ddf.input_name.apply(lambda x: getnn(x),meta=('lsh_result',object))
This operation should distribute across ~40 processes, but currently only works in 1 process.
This might seem an odd issue, though, I am unable to install the library in the environment, where usage of wheels is prohibited (i. e. all the python packages are installed from source). The pip says there is no available source distributions except for version 0.2.6: https://gist.github.com/anonymous/0643b6edf0def8d8f300dbe0b5f0b976. Other libraries usually upload sources, and are installed ok. Can you please upload source distribution to pypi?
There is a typo in the class specified in the title. When assigning the value of 'newsgroup', you should remove "headers" instead of "header", otherwise they remain.
(benchmark/lsh_benchmark.py line 21)
When I use python 3.6, getting an error message like below "expected a bytes-like object, str found". So when I put b in front of the basename I was able to create lsh, but when I query lsh nothing is there from previous addition. Any help is appreciated. Thanks.
lsh = MinHashLSH(
threshold=0.9, num_perm=128, storage_config={
'type': 'redis',
'basename': 'unique_name_6ac4fg',
'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
})
lsh = MinHashLSH(
threshold=0.9, num_perm=128, storage_config={
'type': 'redis',
'basename': b'unique_name_6ac4fg',
'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-43dabc764ea9> in <module>()
15 'type': 'redis',
16 'basename': 'ij',
---> 17 'redis': {'host': 'nodejs-01.deepindex.info', 'port': '6379'}
18 })
19
~\Anaconda2\envs\py36\lib\site-packages\datasketch\lsh.py in __init__(self, threshold, num_perm, weights, params, storage_config, prepickle)
119 self.hashtables = [
120 unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', bytes([i])]))
--> 121 for i in range(self.b)]
122 self.hashranges = [(i*self.r, (i+1)*self.r) for i in range(self.b)]
123 self.keys = ordered_storage(storage_config, name=b''.join([basename, b'_keys']))
~\Anaconda2\envs\py36\lib\site-packages\datasketch\lsh.py in <listcomp>(.0)
119 self.hashtables = [
120 unordered_storage(storage_config, name=b''.join([basename, b'_bucket_', bytes([i])]))
--> 121 for i in range(self.b)]
122 self.hashranges = [(i*self.r, (i+1)*self.r) for i in range(self.b)]
123 self.keys = ordered_storage(storage_config, name=b''.join([basename, b'_keys']))
TypeError: sequence item 0: expected a bytes-like object, str found
I what to find duplicates,so I need pop a hash then compare left hashes.
or should I make a copy of original data to redis?
or just not use the redis storge wait the dataset feed up then use MinHashLSH
is different hashobj will cause huge different profermence?
like hashlib.sha1
or hashlib.md5
will effect CPU or result clusters?
thanks
Would it be possible to include in the documentation the methods for each of the modules in dataSketch and how they are called?
I need to build image search database, using neural network to extract features and using weighted min hash, LSH to find similarity. How can I use elastic search for storing hash keys? I really new to this so please give me some guidance. May I submit pull request for es storage option?
Hey,nice to meet you!
In your code, there is some code to create lsh like:
lsh = MinHashLSH(threshold=0.91, num_perm=128, storage_config={
'type': 'redis',
'redis': {'host': '10.8.35.7', 'port': 6379, 'db': 9}, 'name': 'abcdefghij'}, prepickle=True)
If such lsh had create,and some data had already add into it, how can I change this IP?
Thanks for the well written library.
The library is using too much memory (>1GB) for around 300.000 LeanMinHashes.
a. You use maxhash of 32 bits but your np arrays are using dtype=uint64.
I do not know how numpy memory allocation works but I think it maybe allocates double the needed size.
b. You have an implementation of bMinHash. Does it work? If yes can you give me an example?
Hi,
First of all thanks for this great library :)
I wanted to know if there is any way to set an expiry time for entries being added into an LSH connected to Redis? Or any way to keep a default expiry time for all keys?
Should I instead set a maxmemory policy for my Redis with an allkeys-lru or allkeys-random eviction policy? Is this recommended or will it cause issues with the LSH?
Regards,
Spandan
Following example is not working with 1.2.7 , even after changing the import to
from datasketch.experimental.aio.lsh import AsyncMinHashLSH
I am getting error , invalid syntax for await. I am using python 3.6.5
lsh = await AsyncMinHashLSH(storage_config=_storage, threshold=0.5, num_perm=16)
^
SyntaxError: invalid syntax
https://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale
from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash, MinHashLSH
_storage = {'type': 'aioredis', 'redis': {'host': 'localhost', 'port': 6379}}
lsh = await AsyncMinHashLSH(storage_config=_storage, threshold=0.5, num_perm=16)
m1 = MinHash(16)
m1.update('a'.encode('utf8'))
m2 = MinHash(16)
m2.update('b'.encode('utf8'))
await lsh.insert('a', m1)
await lsh.insert('b', m2)
print(await lsh.query(m1))
print(await lsh.query(m2))
lsh.close()
@ekzhu Hi, my data all insert into MongoDB with aiomongo.
is there anyway to using dynamic threhold after all data insert into Mongo?
thanks alot.
Hello,
I stumbled across your code while looking up Minhashing. I was wondering if there is a way to access the Minhash object to retrieve the hashes?
thanks,
ara
set1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'similarity', 'between', 'datasets'])
m1 = MinHash(num_perm=128)
for d in set1: m1.update(d.encode('utf8'))
lsh.insert("m1", m1)
lsh = MinHashLSH(threshold=0.9, num_perm=128, storage_config={ 'type': 'redis', 'redis': {'host': 'localhost', 'port': 6379,'db': 1},'name':1})
lsh1 = MinHashLSH(threshold=0.9, num_perm=128, storage_config={ 'type': 'redis', 'redis': {'host': 'localhost', 'port': 6379,'db': 1},'name':1})
lsh.keys.size()
is 1 but lsh1.keys.size()
is 0
Why?
When I add more data into lsh, 'Error 111 connecting to localhost:6379. Connection refused',this error was created, what is the problem and how can I slove it?Thank you so much
Hello, I'm really liking this library, but have recently started running into RAM limitations trying to handle billions of MinhashLSH index members on a 16GB quadcore OSX.
Right now, I'm essentially looping over my objects, hashing each, and adding the hash result (of class Minhash) to one large MinhashLSH index for subsequent querying. Is there a less memory intensive way of accomplishing a similar goal? Ideally, I'd like to post the Minhash results to a db then query the db for matches so as to afford much higher scalability.
I'm happy to help put some of this in place if it's not already a part of this toolkit, but wanted to ask in the meantime to see if your team had any immediate thoughts. In any event, thanks again for this awesome work.
Hello!
Thank you for the great lib!
Could you, please, share any hints on the optimal batchsize for inserting hashes into MinHashLSH with redis backend. Indexing is pretty slow for my current settings:
NUM_PERM = 128
THRESHOLD = 0.3
if count % 10000 == 0:
print("Loaded", count)
with self.lsh.insertion_session() as session:
for key, minhash in collector:
session.insert(key, minhash)
This class would be far easier to use if the objects were hashable. As this is a hashing class you would only have to expose the hash value as python interface with __hash__
.
>>> from datasketch import MinHash
>>> m=MinHash()
>>> set([m])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'MinHash'
Hi - I am using MinHashLSH for querying duplicate documents. This is exact flow of my use case.
For this, i am first calculating minhashes for each document and then adding it below LSH object.
MinHashLSH(num_perm=perms,threshold=0.9)
Issue is - this min hashes creation process itself takes around 470 sec. So i can't create it every time i query new document.
So I am planning to store this LSH object in disk for future queries and update it may be once in a day/week.
Did any one try this before or know any better way to handle this type of use case?
Thank You.
As stated in https://guides.github.com/activities/citable-code/ there is a way to get DOI for your open source project on GitHub. Please do! I am citing Datasketch in my paper and really need DOI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.