Giter Site home page Giter Site logo

zseder / hunmisc Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 12.0 1.17 MB

miscellaneous tools/scripts for different NLP related tasks

License: GNU Lesser General Public License v3.0

Python 96.65% Shell 1.90% Perl 0.13% TeX 0.22% Awk 0.03% Jupyter Notebook 1.07%

hunmisc's People

Contributors

david-cliqz avatar davidnemeskey avatar gabor-recski avatar juditacs avatar makrai avatar pajkossy avatar recski avatar zseder avatar zseder-cliqz avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hunmisc's Issues

Accidental push

@zseder I accidentally pushed my changes to master. Could you check them out and see if I have to change anything (HEAD vs HEAD~2)?

levenshtein(): set weights for operations

Enable the user to assign weights to the three basic operations (insertion, deletion and replacement) in LSD. The default must be the current, hard-coded weights (even though it is different from the original case, where the weight of replacement was 2).

Have a custom weighted levenshtein()

Enable the user to customize the LSD further by giving the function three maps:

  • insert_map and delete_map would contain character - weight pairs. Their meaning is: if we insert/delete said character, the cost should grow by the specified weight;
  • replace_map would store character x character -> weight pairs; I hope its role is obvious.

I would create a new function so that the original levenshtein() could remain as simple as possible. I would even go as far as to create a Levenshtein class that takes as parameters the above (and all parameters the original has) and that has a distance(s1, s2) method.

Actually, thinking about it now, the new functionality does not make the implementation that much more complex (only a few dict.get's), but would probably make it a bit slower. What do you think?

Speed up levenshtein

These algorithms should not really be written in Python, and calling them many times can make the program rather slow, so any method that helps to speed them up is welcome.

  1. Max distance. Now for real :)
  2. Use only two lists (+ the marginals).

dbpedia parser

the selection of main category should be rewritten so that it doesn't depend on the order of the entity's lines in the dump

7zip file open

Since pylzma doesn't seem to be working, we should create a wrapper around "7zr" the way we did with gzip in xzip.py

max_distance doesn't work, rolling back...

Since ALL cells in the matrix are computed (which might not actually be necessary, I have to check), we can reach max_distance even though the final result is lower than that. So now it has a new meaning: if max_distance < final result, return that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.