mattandahalfew / levenshtein_search Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 19.0 100 KB

Python search module for fast approximate string matching

License: GNU General Public License v3.0

C 85.72% Python 11.44% Shell 2.85%

levenshtein_search's People

Contributors

Stargazers

Watchers

Forkers

pombredanne dedupeio fgregg benjamesbabala xsongx python3pkg cordje fagan2888 leonright harshit-py scrarlet odidev datamade carlosctrbn fried alow

levenshtein_search's Issues

cibuildwheel

hi @mattandahalfew ,

we’ve been using pypa’s cibuildwheel project to make it a lot easier to build binary wheel s for all the variations of operating systems and machine platforms.

would you like a pull request that added this?

SIGSEGV when using Unicode strings

import Levenshtein_search
Levenshtein_search.populate_wordset(-1, [u'ä'])

>>> Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Benchmarking

@mattandahalfew -- I had a question on Twitter about benchmarking performance. Have you considered running benchmarks against PostgreSQL?

Passing a set to Levenshtein_search.populate_wordset results in SystemError

I ran into an issue trying out this package and passing in a set to populate_wordset.

wordset = Levenshtein_search.populate_wordset(-1, {'abc', 'abg'})

Somewhere later in the code, it failed:

Traceback (most recent call last):
  File "tst.py", line 38, in <module>
    for i, (dat, doc) in enumerate(zip(dats, docs)):
SystemError: ../Objects/listobject.c:169: bad argument to internal function

Would be great if either populate_wordset fails first or converts to list.

remove multiple items causes hang

Hi @mattandahalfew, just uncovered a bug in remove_string

if you remove multiple items, the code will hangs indefinitely.

import Levenshtein_search
index_key = Levenshtein_search.populate_wordset(-1, [])

docs = ['russian/german', 'mexican', 'italian', 'southern',
        'french (new)', 'vegetarian', 'and 212/614-9345 asian', 
        'spanish', 'hot dogs', 'delis', 'peanut butter']

for doc in docs:
    Levenshtein_search.add_string(index_key, doc)

# will hang at some point
for i, doc in enumerate(docs):
    Levenshtein_search.remove_string(index_key, doc)
    print(i)

this is one OS X 10.13.6

Hello!
i'm using this library, and am trying to avoid compiling the wheel myself.. I am using it on windows 64bit currently. If anyone has a wheel I can download and install for python 3.9, I'd appreciate it. Thanks!

Release manylinux1 wheels to pypi

Hi @mattandahalfew, could you release manylinux1 wheels to pypi? Thanks!

Segmentation fault

Hi @mattandahalfew,

I'm getting a segmentation fault with your code:

(.env) fgregg@forest-tmkf:~/public/dedupe$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run tests/canonical_matching.py 
Starting program: /home/fgregg/public/dedupe/.env/bin/python tests/canonical_matching.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2ff2700 (LWP 7859)]
[New Thread 0x7ffff07f1700 (LWP 7860)]
[New Thread 0x7fffedff0700 (LWP 7861)]
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
number of known duplicate pairs 112

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
compare_letters (curr_letter=0x10a0b90, d_x=d_x@entry=0, q_x=q_x@entry=0, 
    c_dist=c_dist@entry=0, maxdist=maxdist@entry=2, 
    query_word=query_word@entry=0x7fffe5d12720 "sushisay", qwordlength=8, 
    letterssofar=0x0, wordlist=0x108d970) at Lev_search.c:535
535	Lev_search.c: No such file or directory.

To reproduce,

git clone https://github.com/datamade/dedupe.git
cd dedupe
pip install -e .
python tests/canonical_matching.py 
(this rerun)
python tests/canonical_matching.py

Version 1.4.5 is no more on PyPi. Is it possible to re-upload it?

Version 1.4.5 of Levensthein_search is not listed anymore on PyPi.

Since there are other packages that explicitly require this version, such as https://github.com/dedupeio/dedupe, is it possible to reupload it?

Appveyor builds

Hi Matt,

For some reason, there hasn't been an appveyor build since last year. https://ci.appveyor.com/project/mattandahalfew/levenshtein-search

Would you mind looking into that. It'd be great to get python 3.7 builds for windows too.

manylinux builds

Hi @mattandahalfew,

For dedupe, I'd like to have all the dependencies have manylinux1 wheels available. Would you be interested in a PR for this for Levenshtein_search

Here's an example of how I've been using Travis to build manylinux wheels:

Add license package metadata

Hi, we have a small automated tool for verifying installed dependencies licenses (to make sure we can use them in our project) and Levenshtein_search does not provide license metadata (official docs), meaning we have to utilize GitHub API to retrieve it.

Could you please specify the branch to which we should open a PR with the change (as there are currently two in this project with develop being fresher)?

Thank you for your time.

Add support to release aarch64 wheels

Problem

On aarch64, ‘pip install Levenshtein-search’ builds the wheels from source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, ‘pip install Levenshtein-search’ should download the wheels from pypi

@mattandahalfew Please let me know your interest in releasing aarch64 wheels. I can help in this. Is there any plan to move to Travis-ci.com? If not, could you please tell the steps/CI which you are using presently to release the wheel on pypi?

Returning index of each matched string

Hi,

Thanks for this great tool!

I had one question, is it possible to also return the index in the wordset that a particular query word matches? For example, using the example in the README, can I do the following:

import Levenshtein_search

excerpt1 = ["We","went","to","the","fire","Mother","said","Is","he","cold","Versh","Nome","Versh","said","Take","his","overcoat","and","overshoes","off","Mother","said","How","many","times","do","I","have","to","tell","you","not","to","bring","him","into","the","house","with","his","overshoes","on"]

first_wordset = Levenshtein_search.populate_wordset(-1,excerpt1)

q = "overshoes"
maxdist = 4
results1 = Levenshtein_search.lookup(first_wordset,q,maxdist)

And somehow get the indices of all matches (without using the results1 output and iterating through the excerpt1 list again)? Iterating again through the excerpt1 list would be very slow for my large applications of this.

Thanks!

Keshav

License of the project silently changed

Until March 9 2021 the master branch was MIT-licensed, so the resulting PIP package for Python_levenshtein was MIT-licensed too.

With merge of the develop branch to master the license of the master branch silently changed to GPL 3 which is much stricter than MIT: #21 (comment) and ed8c0f7

Such change is not announced in the README and not communicated to PIP packages depending on Levenshtein_search, along with HOWTO for pinning to older package version with the permissive license. Especially when the original master branch missed the PIP metadata for the license, this change may cause troubles for the downstream.

Could you please reconsider using the original license (MIT) for the current codebase?

Or could you please at least communicate the change in way that doesn't silently break the downstream? Ideally by reverting the merge of develop to master, releasing a PIP package with fixed license metadata (still MIT-licensed) and adding an import-time deprecation warning that the PIP package is now unmaintained + the same warning in README so it will be visible on PyPI. The code with new license would then be merged to master and released via own PIP package name and bumped major version so the downstream cannot accidentally run into licensing issues via pip install --upgrade.