Giter Site home page Giter Site logo

levenshtein_search's People

Contributors

fgregg avatar fried avatar hhhhhhhhhn avatar mattandahalfew avatar odidev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

levenshtein_search's Issues

cibuildwheel

hi @mattandahalfew ,

we’ve been using pypa’s cibuildwheel project to make it a lot easier to build binary wheel s for all the variations of operating systems and machine platforms.

would you like a pull request that added this?

SIGSEGV when using Unicode strings

import Levenshtein_search
Levenshtein_search.populate_wordset(-1, [u'ä'])

>>> Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Passing a set to Levenshtein_search.populate_wordset results in SystemError

I ran into an issue trying out this package and passing in a set to populate_wordset.

wordset = Levenshtein_search.populate_wordset(-1, {'abc', 'abg'})

Somewhere later in the code, it failed:

Traceback (most recent call last):
  File "tst.py", line 38, in <module>
    for i, (dat, doc) in enumerate(zip(dats, docs)):
SystemError: ../Objects/listobject.c:169: bad argument to internal function

Would be great if either populate_wordset fails first or converts to list.

remove multiple items causes hang

Hi @mattandahalfew, just uncovered a bug in remove_string

if you remove multiple items, the code will hangs indefinitely.

import Levenshtein_search
index_key = Levenshtein_search.populate_wordset(-1, [])

docs = ['russian/german', 'mexican', 'italian', 'southern',
        'french (new)', 'vegetarian', 'and 212/614-9345 asian', 
        'spanish', 'hot dogs', 'delis', 'peanut butter']

for doc in docs:
    Levenshtein_search.add_string(index_key, doc)

# will hang at some point
for i, doc in enumerate(docs):
    Levenshtein_search.remove_string(index_key, doc)
    print(i)

this is one OS X 10.13.6

python 3.9 wheels

Hello!
i'm using this library, and am trying to avoid compiling the wheel myself.. I am using it on windows 64bit currently. If anyone has a wheel I can download and install for python 3.9, I'd appreciate it. Thanks!

Segmentation fault

Hi @mattandahalfew,

I'm getting a segmentation fault with your code:

(.env) fgregg@forest-tmkf:~/public/dedupe$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run tests/canonical_matching.py 
Starting program: /home/fgregg/public/dedupe/.env/bin/python tests/canonical_matching.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2ff2700 (LWP 7859)]
[New Thread 0x7ffff07f1700 (LWP 7860)]
[New Thread 0x7fffedff0700 (LWP 7861)]
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt
number of known duplicate pairs 112

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
compare_letters (curr_letter=0x10a0b90, d_x=d_x@entry=0, q_x=q_x@entry=0, 
    c_dist=c_dist@entry=0, maxdist=maxdist@entry=2, 
    query_word=query_word@entry=0x7fffe5d12720 "sushisay", qwordlength=8, 
    letterssofar=0x0, wordlist=0x108d970) at Lev_search.c:535
535	Lev_search.c: No such file or directory.

To reproduce,

git clone https://github.com/datamade/dedupe.git
cd dedupe
pip install -e .
python tests/canonical_matching.py 
(this rerun)
python tests/canonical_matching.py

Add license package metadata

Hi, we have a small automated tool for verifying installed dependencies licenses (to make sure we can use them in our project) and Levenshtein_search does not provide license metadata (official docs), meaning we have to utilize GitHub API to retrieve it.

Could you please specify the branch to which we should open a PR with the change (as there are currently two in this project with develop being fresher)?

Thank you for your time.

Add support to release aarch64 wheels

Problem

On aarch64, ‘pip install Levenshtein-search’ builds the wheels from source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, ‘pip install Levenshtein-search’ should download the wheels from pypi

@mattandahalfew Please let me know your interest in releasing aarch64 wheels. I can help in this. Is there any plan to move to Travis-ci.com? If not, could you please tell the steps/CI which you are using presently to release the wheel on pypi?

Returning index of each matched string

Hi,

Thanks for this great tool!

I had one question, is it possible to also return the index in the wordset that a particular query word matches? For example, using the example in the README, can I do the following:

import Levenshtein_search

excerpt1 = ["We","went","to","the","fire","Mother","said","Is","he","cold","Versh","Nome","Versh","said","Take","his","overcoat","and","overshoes","off","Mother","said","How","many","times","do","I","have","to","tell","you","not","to","bring","him","into","the","house","with","his","overshoes","on"]

first_wordset = Levenshtein_search.populate_wordset(-1,excerpt1)

q = "overshoes"
maxdist = 4
results1 = Levenshtein_search.lookup(first_wordset,q,maxdist)

And somehow get the indices of all matches (without using the results1 output and iterating through the excerpt1 list again)? Iterating again through the excerpt1 list would be very slow for my large applications of this.

Thanks!

Keshav

License of the project silently changed

Until March 9 2021 the master branch was MIT-licensed, so the resulting PIP package for Python_levenshtein was MIT-licensed too.

With merge of the develop branch to master the license of the master branch silently changed to GPL 3 which is much stricter than MIT: #21 (comment) and ed8c0f7

Such change is not announced in the README and not communicated to PIP packages depending on Levenshtein_search, along with HOWTO for pinning to older package version with the permissive license. Especially when the original master branch missed the PIP metadata for the license, this change may cause troubles for the downstream.

Could you please reconsider using the original license (MIT) for the current codebase?

Or could you please at least communicate the change in way that doesn't silently break the downstream? Ideally by reverting the merge of develop to master, releasing a PIP package with fixed license metadata (still MIT-licensed) and adding an import-time deprecation warning that the PIP package is now unmaintained + the same warning in README so it will be visible on PyPI. The code with new license would then be merged to master and released via own PIP package name and bumped major version so the downstream cannot accidentally run into licensing issues via pip install --upgrade.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.