Giter Site home page Giter Site logo

robertgr991 / fastdameraulevenshtein Goto Github PK

View Code? Open in Web Editor NEW
16.0 16.0 2.0 65 KB

Cython implementation of true Damerau-Levenshtein algorithm.

License: MIT License

Python 100.00%
cython damerau-levenshtein damerau-levenshtein-distance edit-distance-algorithm

fastdameraulevenshtein's People

Contributors

robertgr991 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

vlesu legale

fastdameraulevenshtein's Issues

Fails to build

After failing to install via pip3, I tried to build directly from source and I believe I just hit the same problem. This what python3 setup.py install produces:

running install
running bdist_egg
running egg_info
creating fastDamerauLevenshtein.egg-info
writing fastDamerauLevenshtein.egg-info/PKG-INFO
writing dependency_links to fastDamerauLevenshtein.egg-info/dependency_links.txt
writing top-level names to fastDamerauLevenshtein.egg-info/top_level.txt
writing manifest file 'fastDamerauLevenshtein.egg-info/SOURCES.txt'
reading manifest file 'fastDamerauLevenshtein.egg-info/SOURCES.txt'
writing manifest file 'fastDamerauLevenshtein.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.15-x86_64/egg
running install_lib
running build_ext
building 'fastDamerauLevenshtein' extension
creating build
creating build/temp.macosx-10.15-x86_64-3.8
creating build/temp.macosx-10.15-x86_64-3.8/fastDamerauLevenshtein
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/include/python3.8 -c fastDamerauLevenshtein/fastDamerauLevenshtein.c -o build/temp.macosx-10.15-x86_64-3.8/fastDamerauLevenshtein/fastDamerauLevenshtein.o
fastDamerauLevenshtein/fastDamerauLevenshtein.c:4246:9: error: too many arguments to function call, expected 15, have
      16
        __pyx_empty_bytes  /*PyObject *lnotab*/
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fastDamerauLevenshtein/fastDamerauLevenshtein.c:331:82: note: expanded from macro '__Pyx_PyCode_New'
          PyCode_New(a, 0, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)
          ~~~~~~~~~~                                                             ^~~~
/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/include/python3.8/code.h:122:12: note:
      'PyCode_New' declared here
PyAPI_FUNC(PyCodeObject *) PyCode_New(
           ^
1 error generated.
error: command 'clang' failed with exit status 1

The macro at line 326 appears to imply that PyCode_New will take 16 parameters rather than 15 starting at v. 3.8 but Python documentation appears to me to state otherwise.

Fails to build on Python 3.8 virtual environment

Running setup.py install for fastDamerauLevenshtein did not run successfully.
│ exit code: 1
╰─> [12 lines of output]
/home/ec2-user/emr_cluster/lib64/python3.8/site-packages/setuptools/dist.py:642: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
warnings.warn(
running install
running build
running build_ext
building 'fastDamerauLevenshtein' extension
creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastDamerauLevenshtein
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/home/ec2-user/emr_cluster/include -I/usr/include/python3.8 -c fastDamerauLevenshtein/fastDamerauLevenshtein.c -o build/temp.linux-x86_64-3.8/fastDamerauLevenshtein/fastDamerauLevenshtein.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> fastDamerauLevenshtein

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Incorrect score for similarity=True

Great package but I just noticed a bug with the the score in certain situations. If I run
damerauLevenshtein('some string', 'another one but longer', deleteWeight=1, insertWeight=3, replaceWeight=6, swapWeight=6, similarity=True)
I get a score of 0.03636... but if I run
damerauLevenshtein('some string', 'another one but longer and longer', deleteWeight=1, insertWeight=3, replaceWeight=6, swapWeight=6, similarity=True)
I get a score of 1.0 implying the two strings are identical.

From what I could see, it looks like the issue stems from the line of code
maxDist = min(len1, len2) * min(replaceWeight, deleteWeight + insertWeight) + (max(len1, len2) - min(len1, len2)) * min(deleteWeight, insertWeight)
which is (assuming I've understood your code) supposed to calculate the maximum distance as the cost of swapping out letters in the shorter word + the cost of adding/removing any excess letters

But for my example strings, I believe it should use the insertWeight at the end rather than min(deleteWeight, insertWeight) - there's no way to get from string1 to string2 by deletion, it definitely needs insertion. So I think basically the min() needs to be replaced with an if that checks whether insertions or deletions will be required to get from string1 to string2.

I'm running python 3.7.3 and fastDamerauLevenshtein v1.0.7

Fails to build on Python 3.11 - longintrepr.h: No such file or directory

Similar issue as here

To reproduce

pip install fastdameraulevenshtein

Logs

Collecting fastdameraulevenshtein
  Using cached fastDamerauLevenshtein-1.0.7.tar.gz (36 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: fastdameraulevenshtein
  Building wheel for fastdameraulevenshtein (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for fastdameraulevenshtein (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [28 lines of output]
      /tmp/pip-build-env-3upkms0f/overlay/lib/python3.11/site-packages/setuptools/dist.py:745: SetuptoolsDeprecationWarning: Invalid dash-separated options
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
      
              By 2023-Sep-26, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self.warn_dash_deprecation(opt, section)
      running bdist_wheel
      running build
      running build_ext
      building 'fastDamerauLevenshtein' extension
      creating build
      creating build/temp.linux-x86_64-cpython-311
      creating build/temp.linux-x86_64-cpython-311/fastDamerauLevenshtein
      gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/asl-env/include -I/usr/local/include/python3.11 -c fastDamerauLevenshtein/fastDamerauLevenshtein.c -o build/temp.linux-x86_64-cpython-311/fastDamerauLevenshtein/fastDamerauLevenshtein.o
      fastDamerauLevenshtein/fastDamerauLevenshtein.c:209:12: fatal error: longintrepr.h: No such file or directory
        209 |   #include "longintrepr.h"
            |            ^~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fastdameraulevenshtein
Failed to build fastdameraulevenshtein

Python version

3.11.4

Working with very large strings

Hello. Thank you for your project. I have one question, I hope for Your help.
I have two very large arrays of numbers (integers, from 0 to 100), I want to find out how similar these two arrays are. I thought I could translate these lists to string list, and then use Your string similarity metric. However, the size of arrays is very large (about 16,000 elements), and there is not enough memory.
Perhaps there is a way to calculate this metric approximately in order to use less memory, or maybe somehow reduce the array?
Thank you very much for any help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.