kayzhu / lshash Goto Github PK

View Code? Open in Web Editor NEW

660.0 660.0 158.0 346 KB

A fast Python implementation of locality sensitive hashing.

License: MIT License

Python 100.00%

lshash's People

Contributors

Stargazers

Watchers

Forkers

c0mpsc1 nucflash frrmack vins31 plusbzz scottcode jfelectron zehsilva simonemainardi anna0709 xacce herrbuerger slitayem rtvt123 dbernardoj lizhangzhan honglongwu jesserobertson ivendrov cynicalanlz freakthemighty ejlb gopinutakki xieyanfu imsparsh stevenlol rmagesh148 yangspeaking dcrankshaw qsong4 njuhugn ajgappmark edzhangjianyu lovetimil dongzhuoyao codingafuture gexinworks perluhin mollystark cmcneil hmmbug hobson totalgood phdowling parsegarden naushadzaman luismojica blkstone roccy takehirosekine wzh404 samzhang111 faisal-w superashan johnmeade acerge aresthu thesage21 alancucki xiufranklin channabasavagola-zz mzk665 edvinsson wittyfilter joschif ubear lckfork mdomans smarty1palak tony-hou caohao2008 yyzreal sunnycs f0xxx russellwmy afcarl tidesq zbhatti yuzhao12 gaoyz0625 sunzequn michaelstarkey bagayalu wxb263stu pakchoi xiaoqingwang tearsl smartcai anirband armheb aid91 shihui628 aakashnaik will-holden michaelldd webee zorrock zhleternity conleykong bobinsky-zju

lshash's Issues

This still doesn't work

Yea I'm getting the same errors as everyone else, is possible the pypi wasn't updated with your latest edits?

Can't use with high dimension

when I set the LSHash's input_dim with 300 dimension. the result of lsh.query is [ ].

How to set a appropriate hash size?

Sorry to bother. This might be a stupid question.
I really don't know how to set the hash size, is that depend on my data size?
In the quick start example, I try to change LSHash (6, 8) to LSHash(3, 8) and getting the same result.

lsh2 = LSHash(3, 8)
lsh2.index([1,2,3,4,5,6,7,8])
lsh2.index([10,12,99,1,5,31,2,3])
lsh2.query([1,2,3,4,5,6,7,7])
[((1, 2, 3, 4, 5, 6, 7, 8), 1), ((2, 3, 4, 5, 6, 7, 8, 9), 11)]

Thanks!

lshash use redis

LSH family for Hamming distance

I've learned the Lsh algorithm recently. And I found your implementation. Quite useful to me!
But as far as I know, there are different Lsh families for different distance measurement. It seems the index method you use is random planes for cosine distance, am I right?
I'd like to ask you:

is it necessary or right to expand the index function for different distance measurement?
if it's right and necessary, will you or let me do it？

Sparse Matrix

Is it possible to use this package with sparse matrix?
I mean without converting it to dense matrix, something like this:

np.ndarray.flatten(vector.toarray())

Actually I want to use Locality Sensitive Hashing on text data and now I convert text to vector with scikit-learn's feature extraction module.

how to obtain the actual hash for each vector?

>>> lsh = LSHash(6, 8)
>>> lsh.index([1,2,3,4,5,6,7,8])

Is there any way to obtain the lsh signature from the index(..) method?

projection type

The code is using np.random.randn() times input vector.
In the LSH paper survey, we are using either (Gaussian Distribution * input + bias)/W or (Uniform Distribution * input). I was wondering if we should change the distribution to uniform in the code?

pip install fails

c:\Python33\Scripts>pip install lshash
Downloading/unpacking lshash
Could not find a version that satisfies the requirement lshash (from versions:
0.0.2dev, 0.0.3dev, 0.0.4dev)
Cleaning up...
No distributions matching the version for lshash
Storing complete log in C:\Users\t-hanans\pip\pip.log

Is faster query time possible?

Hi,

I was wondering if there would be any way to reduce the query time.
For example, for my use case with the following parameters, it is 0.2 s which would be too slow for querying my whole dataset:
lsh = LSHash(10, 300)
lsh.query(example_vector, num_results=5) (changing num_results doesn't have an effect on its run time)

Any suggestions would be appreciated!

Thank you

query method inefficiently implemented?

In it's current form, it seems that 'LShash.query' simply computes a hash for a query vector, and then computes the hamming distance manually for each point in each hash table. That means we are doing num_hashtables * num_indexed_points distance computations for every query.

Specifically, the problem is here:
https://github.com/kayzh/LSHash/blob/master/lshash/lshash.py#L237
We should not need to iterate through all keys in the index just to query one vector.

From what I understand, the whole point of LSH is to do hashed lookups rather than scanning all our data. Here we are computing hashes and then not using the one feature that makes them so fast, the fact that we can throw them into an index and do lookups in O(1) or at least O(log(n)). Lookups in O(n) are what LSH is supposed to be replacing. This implementation may still be fast enough for some use cases since the hamming distance computations might be cheaper than dot products on large vectors, but it does not scale in the number of samples in the index, which I would see as a serious problem.

Any thoughts on this? Please correct me if I got anything wrong here. I don't mean to complain, this is a library which I enjoy using, but it seems that a core part of it could be made more efficient.

How to delete on index item?

How to delete on index item? Not only add a item to index, but how to delete one?

Any know how to change or add distance function?

lshash.py line 298 like that:
@staticmethod
def euclidean_dist(x, y):
""" This is a hot function, hence some optimizations are made. """
diff = np.array(x)-y
return np.sqrt(np.dot(diff, diff))

I change the diff like that:
diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y

But when I print(lsh.query([3,4,5,3,4,5,3,4],distance_func="euclidean"))
Compare to the origin euclidean_dist , the result is same .
It seems that it's not a correct method to change distance function.
Anyone could tell me how to change distance function?
Thanks a lot!

pip install error

I am trying to install this package. when I run pipe install it gives me this error:

pip install lshash==0.0.4dev
Collecting lshash==0.0.4dev
Using cached lshash-0.0.4dev.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/setup.py", line 3, in
import lshash
File "/private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash/lshash/init.py", line 12, in
from lshash import LSHash
ImportError: cannot import name 'LSHash'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3y/98pmr_g51g988x_v957qmh100000gn/T/pip-build-vgvzxrjr/lshash

ImportError: cannot import name 'LSHash'

pip installed lshash and attempted to import LSHash however I get the above error.

versions:
python - 3.6.7
numpy - 1.15.4

I am operating within a conda environment.

Is there any way to also obtain the indices in result of query?

Hi,

I was wondering if there is any way to also obtain the indices of the results of a query since a lot of times we may need to for example go back to some sort of embedding and see what element it is representing.

Thank you so much for the amazing work!

ImportError: cannot import name 'LSHash' from 'lshash'

If you get ImportError, you can use https://github.com/guofei9987/pyLSHash

I will continue to maintain LSHash in https://github.com/guofei9987/pyLSHash

using on word vectors

I want to use this on bunch of word vectors and find the similar ones.

Should I firs index all of the vectors, and query each one again to find the bucket number?

Is it possible to query results based on threshold.

Many LSH implementations use Jaccard similarity to return matching result above a certain threshold say 80% match.
Is possible to implement the same in this library.

It seems LSHash is only used for less storage space, not used for less compute time.

Am I right? Thank you.

lshash.query() returning empty values

Hi everyone. Don't know whats going on but my index is querying only over the points I have already submitted. For example

>>> import numpy as np
>>> import lshash
>>> lsh = lshash.LSHash(100,100)
>>> sample = np.zeros(100)
>>> sample[13]=1
>>> sample[43]=1
>>> sample[73]=1
>>> lsh.index(sample)
>>> sample[93]=1
>>> lsh.index(sample)
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.index(sample)
>>> sample[33]=0
>>> sample[93]=1
>>> lsh.query(sample)
[]
>>> sample = np.zeros(100)
>>> sample[33]=1
>>> sample[32]=1
>>> lsh.query(sample)
[((0.0, 0.0, 0.0, ..., 0.0, 0.0), 0.0)]
>>>

I am doing something wrong?

Limit the number of points in each bucket

Hi all,

is there any way to limit the number of points in each bucket?
Thanks in advance.

Matteo

Why can't get exact amounts of result?

Hi, I set 'r = lsh.query(inputs[200], distance_func='hamming', num_results=20)', in which the number of returned results is 20, but sometimes I just got only one result(or less than 20 items) .Can you tell me the reason? Or how should I fixed it?
Thanks a lot,
Sincerely

Way to handle NaN-values

It would be nice to have a way to specify how to handle NaN's in the data. For example by ignoring them in distance calculations.

save hashtable to .npz file

i have installed redis successfully and on
providing the argument
storage_config={"redis": {"host": 'localhost', "port": 6379}},matrices_filename="/home/username/filename.npz"
where filename.npz has not been created nor is hashtable stored on redis ..
the program and query runs successfully giving the output vector but doesn't save a new .npz file or save the hashtables

It seems LSHash is only used for less storage space, not used for less compute time.

Am I right? Thank you.

kayzhu / lshash Goto Github PK

lshash's People

Contributors

Stargazers

Watchers

Forkers

lshash's Issues

lshash.py line 298 like that: @staticmethod def euclidean_dist(x, y): """ This is a hot function, hence some optimizations are made. """ diff = np.array(x)-y return np.sqrt(np.dot(diff, diff))

I change the diff like that: diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y

Recommend Projects

Recommend Topics

Recommend Org

lshash.py line 298 like that:
@staticmethod
def euclidean_dist(x, y):
""" This is a hot function, hence some optimizations are made. """
diff = np.array(x)-y
return np.sqrt(np.dot(diff, diff))

I change the diff like that:
diff = np.array(x)-y change to np.array(x)-(mean(x)-mean(y))-y