kayzhu / lshash Goto Github PK
View Code? Open in Web Editor NEWA fast Python implementation of locality sensitive hashing.
License: MIT License
A fast Python implementation of locality sensitive hashing.
License: MIT License
Yea I'm getting the same errors as everyone else, is possible the pypi wasn't updated with your latest edits?
when I set the LSHash's input_dim with 300 dimension. the result of lsh.query is [ ].
Sorry to bother. This might be a stupid question.
I really don't know how to set the hash size, is that depend on my data size?
In the quick start example, I try to change LSHash (6, 8)
to LSHash(3, 8)
and getting the same result.
lsh2 = LSHash(3, 8)
lsh2.index([1,2,3,4,5,6,7,8])
lsh2.index([10,12,99,1,5,31,2,3])
lsh2.query([1,2,3,4,5,6,7,7])
[((1, 2, 3, 4, 5, 6, 7, 8), 1), ((2, 3, 4, 5, 6, 7, 8, 9), 11)]
Thanks!
I've learned the Lsh algorithm recently. And I found your implementation. Quite useful to me!
But as far as I know, there are different Lsh families for different distance measurement. It seems the index method you use is random planes for cosine distance, am I right?
I'd like to ask you:
Is it possible to use this package with sparse matrix?
I mean without converting it to dense matrix
, something like this:
np.ndarray.flatten(vector.toarray())
Actually I want to use Locality Sensitive Hashing on text data and now I convert text to vector with scikit-learn
's feature extraction module.
>>> lsh = LSHash(6, 8)
>>> lsh.index([1,2,3,4,5,6,7,8])
Is there any way to obtain the lsh signature from the index(..) method?
The code is using np.random.randn() times input vector.
In the LSH paper survey, we are using either (Gaussian Distribution * input + bias)/W or (Uniform Distribution * input). I was wondering if we should change the distribution to uniform in the code?
c:\Python33\Scripts>pip install lshash
Downloading/unpacking lshash
Could not find a version that satisfies the requirement lshash (from versions:
0.0.2dev, 0.0.3dev, 0.0.4dev)
Cleaning up...
No distributions matching the version for lshash
Storing complete log in C:\Users\t-hanans\pip\pip.log
Hi,
I was wondering if there would be any way to reduce the query time.
For example, for my use case with the following parameters, it is 0.2 s which would be too slow for querying my whole dataset:
lsh = LSHash(10, 300)
lsh.query(example_vector, num_results=5) (changing num_results doesn't have an effect on its run time)
Any suggestions would be appreciated!
Thank you
In it's current form, it seems that 'LShash.query' simply computes a hash for a query vector, and then computes the hamming distance manually for each point in each hash table. That means we are doing num_hashtables * num_indexed_points
distance computations for every query.
Specifically, the problem is here:
https://github.com/kayzh/LSHash/blob/master/lshash/lshash.py#L237
We should not need to iterate through all keys in the index just to query one vector.
From what I understand, the whole point of LSH is to do hashed lookups rather than scanning all our data. Here we are computing hashes and then not using the one feature that makes them so fast, the fact that we can throw them into an index and do lookups in O(1)
or at least O(log(n))
. Lookups in O(n)
are what LSH is supposed to be replacing. This implementation may still be fast enough for some use cases since the hamming distance computations might be cheaper than dot products on large vectors, but it does not scale in the number of samples in the index, which I would see as a serious problem.
Any thoughts on this? Please correct me if I got anything wrong here. I don't mean to complain, this is a library which I enjoy using, but it seems that a core part of it could be made more efficient.
How to delete on index item? Not only add a item to index, but how to delete one?
But when I print(lsh.query([3,4,5,3,4,5,3,4],distance_func="euclidean"))
Compare to the origin euclidean_dist , the result is same .
It seems that it's not a correct method to change distance function.
Anyone could tell me how to change distance function?
Thanks a lot!
I am trying to install this package. when I run pipe install it gives me this error:
pip install lshash==0.0.4dev
Collecting lshash==0.0.4dev
Using cached lshash-0.0.4dev.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/setup.py", line 3, in
import lshash
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/lshash/init.py", line 12, in
from lshash import LSHash
ImportError: cannot import name 'LSHash'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash
pip installed lshash and attempted to import LSHash however I get the above error.
versions:
python - 3.6.7
numpy - 1.15.4
I am operating within a conda environment.
Hi,
I was wondering if there is any way to also obtain the indices of the results of a query since a lot of times we may need to for example go back to some sort of embedding and see what element it is representing.
Thank you so much for the amazing work!
I will continue to maintain LSHash in https://github.com/guofei9987/pyLSHash
I want to use this on bunch of word vectors and find the similar ones.
Should I firs index all of the vectors, and query each one again to find the bucket number?
Many LSH implementations use Jaccard similarity to return matching result above a certain threshold say 80% match.
Is possible to implement the same in this library.
Am I right? Thank you.
Hi everyone. Don't know whats going on but my index is querying only over the points I have already submitted. For example
>>> import numpy as np
>>> import lshash
>>> lsh = lshash.LSHash(100,100)
>>> sample = np.zeros(100)
>>> sample[13]=1
>>> sample[43]=1
>>> sample[73]=1
>>> lsh.index(sample)
>>> sample[93]=1
>>> lsh.index(sample)
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.index(sample)
>>> sample[33]=0
>>> sample[93]=1
>>> lsh.query(sample)
[]
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.query(sample)
[((0.0, 0.0, 0.0, ..., 0.0, 0.0), 0.0)]
>>>
I am doing something wrong?
Hi all,
is there any way to limit the number of points in each bucket?
Thanks in advance.
Matteo
Hi, I set 'r = lsh.query(inputs[200], distance_func='hamming', num_results=20)', in which the number of returned results is 20, but sometimes I just got only one result(or less than 20 items) .Can you tell me the reason? Or how should I fixed it?
Thanks a lot,
Sincerely
It would be nice to have a way to specify how to handle NaN's in the data. For example by ignoring them in distance calculations.
i have installed redis successfully and on
providing the argument
storage_config={"redis": {"host": 'localhost', "port": 6379}}
,matrices_filename="/home/username/filename.npz"
where filename.npz has not been created nor is hashtable stored on redis ..
the program and query runs successfully giving the output vector but doesn't save a new .npz file or save the hashtables
Am I right? Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.