mattandahalfew / levenshtein_search Goto Github PK
View Code? Open in Web Editor NEWPython search module for fast approximate string matching
License: GNU General Public License v3.0
Python search module for fast approximate string matching
License: GNU General Public License v3.0
hi @mattandahalfew ,
we’ve been using pypa’s cibuildwheel project to make it a lot easier to build binary wheel s for all the variations of operating systems and machine platforms.
would you like a pull request that added this?
import Levenshtein_search
Levenshtein_search.populate_wordset(-1, [u'ä'])
>>> Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
@mattandahalfew -- I had a question on Twitter about benchmarking performance. Have you considered running benchmarks against PostgreSQL?
I ran into an issue trying out this package and passing in a set to populate_wordset
.
wordset = Levenshtein_search.populate_wordset(-1, {'abc', 'abg'})
Somewhere later in the code, it failed:
Traceback (most recent call last):
File "tst.py", line 38, in <module>
for i, (dat, doc) in enumerate(zip(dats, docs)):
SystemError: ../Objects/listobject.c:169: bad argument to internal function
Would be great if either populate_wordset
fails first or converts to list.
Hi @mattandahalfew, just uncovered a bug in remove_string
if you remove multiple items, the code will hangs indefinitely.
import Levenshtein_search
index_key = Levenshtein_search.populate_wordset(-1, [])
docs = ['russian/german', 'mexican', 'italian', 'southern',
'french (new)', 'vegetarian', 'and 212/614-9345 asian',
'spanish', 'hot dogs', 'delis', 'peanut butter']
for doc in docs:
Levenshtein_search.add_string(index_key, doc)
# will hang at some point
for i, doc in enumerate(docs):
Levenshtein_search.remove_string(index_key, doc)
print(i)
this is one OS X 10.13.6
Hello!
i'm using this library, and am trying to avoid compiling the wheel myself.. I am using it on windows 64bit currently. If anyone has a wheel I can download and install for python 3.9, I'd appreciate it. Thanks!
Hi @mattandahalfew, could you release manylinux1 wheels to pypi? Thanks!
Hi @mattandahalfew,
I'm getting a segmentation fault with your code:
(.env) fgregg@forest-tmkf:~/public/dedupe$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run tests/canonical_matching.py
Starting program: /home/fgregg/public/dedupe/.env/bin/python tests/canonical_matching.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2ff2700 (LWP 7859)]
[New Thread 0x7ffff07f1700 (LWP 7860)]
[New Thread 0x7fffedff0700 (LWP 7861)]
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
number of known duplicate pairs 112
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
compare_letters (curr_letter=0x10a0b90, d_x=d_x@entry=0, q_x=q_x@entry=0,
c_dist=c_dist@entry=0, maxdist=maxdist@entry=2,
query_word=query_word@entry=0x7fffe5d12720 "sushisay", qwordlength=8,
letterssofar=0x0, wordlist=0x108d970) at Lev_search.c:535
535 Lev_search.c: No such file or directory.
To reproduce,
git clone https://github.com/datamade/dedupe.git
cd dedupe
pip install -e .
python tests/canonical_matching.py
(this rerun)
python tests/canonical_matching.py
Version 1.4.5 of Levensthein_search is not listed anymore on PyPi.
Since there are other packages that explicitly require this version, such as https://github.com/dedupeio/dedupe, is it possible to reupload it?
Hi Matt,
For some reason, there hasn't been an appveyor build since last year. https://ci.appveyor.com/project/mattandahalfew/levenshtein-search
Would you mind looking into that. It'd be great to get python 3.7 builds for windows too.
Hi @mattandahalfew,
For dedupe, I'd like to have all the dependencies have manylinux1 wheels available. Would you be interested in a PR for this for Levenshtein_search
Here's an example of how I've been using Travis to build manylinux wheels:
Hi, we have a small automated tool for verifying installed dependencies licenses (to make sure we can use them in our project) and Levenshtein_search
does not provide license metadata (official docs), meaning we have to utilize GitHub API to retrieve it.
Could you please specify the branch to which we should open a PR with the change (as there are currently two in this project with develop
being fresher)?
Thank you for your time.
On aarch64, ‘pip install Levenshtein-search’ builds the wheels from source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes some time to build the wheels than downloading and extracting the wheels from pypi.
On aarch64, ‘pip install Levenshtein-search’ should download the wheels from pypi
@mattandahalfew Please let me know your interest in releasing aarch64 wheels. I can help in this. Is there any plan to move to Travis-ci.com? If not, could you please tell the steps/CI which you are using presently to release the wheel on pypi?
Hi,
Thanks for this great tool!
I had one question, is it possible to also return the index in the wordset that a particular query word matches? For example, using the example in the README, can I do the following:
import Levenshtein_search
excerpt1 = ["We","went","to","the","fire","Mother","said","Is","he","cold","Versh","Nome","Versh","said","Take","his","overcoat","and","overshoes","off","Mother","said","How","many","times","do","I","have","to","tell","you","not","to","bring","him","into","the","house","with","his","overshoes","on"]
first_wordset = Levenshtein_search.populate_wordset(-1,excerpt1)
q = "overshoes"
maxdist = 4
results1 = Levenshtein_search.lookup(first_wordset,q,maxdist)
And somehow get the indices of all matches (without using the results1 output and iterating through the excerpt1 list again)? Iterating again through the excerpt1 list would be very slow for my large applications of this.
Thanks!
Keshav
Until March 9 2021 the master branch was MIT-licensed, so the resulting PIP package for Python_levenshtein was MIT-licensed too.
With merge of the develop branch to master the license of the master branch silently changed to GPL 3 which is much stricter than MIT: #21 (comment) and ed8c0f7
Such change is not announced in the README and not communicated to PIP packages depending on Levenshtein_search, along with HOWTO for pinning to older package version with the permissive license. Especially when the original master branch missed the PIP metadata for the license, this change may cause troubles for the downstream.
Could you please reconsider using the original license (MIT) for the current codebase?
Or could you please at least communicate the change in way that doesn't silently break the downstream? Ideally by reverting the merge of develop to master, releasing a PIP package with fixed license metadata (still MIT-licensed) and adding an import-time deprecation warning that the PIP package is now unmaintained + the same warning in README so it will be visible on PyPI. The code with new license would then be merged to master and released via own PIP package name and bumped major version so the downstream cannot accidentally run into licensing issues via pip install --upgrade.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.