Giter Site home page Giter Site logo

barrust / pyspellchecker Goto Github PK

View Code? Open in Web Editor NEW
690.0 690.0 101.0 101.48 MB

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/

License: MIT License

Python 100.00%
levenshtein-distance python python-spell-checking spellcheck spellchecker spelling-checker

pyspellchecker's People

Contributors

barrust avatar blayzen-w avatar cangareijo avatar cast42 avatar davido-brainlabs avatar ebuildy avatar grayjk avatar johnosbb avatar mrjamesriley avatar msalhab96 avatar raivisdejus avatar stephencawood avatar sviperm avatar xezpeleta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyspellchecker's Issues

Non-deterministic correction

When correcting a word, if the first candidates have the same probability it seems that the chosen one will depend on the dictionary (non) order. So the correction is non-deterministic in that case.

It would be great to add an arbitrary but constant order in that case.

Windows encoding errors not fixed in version 0.5.1

On windows, spellchecker still has problems with encoding errors:

import spellchecker
spell = spellchecker.SpellChecker("de")
# This word is correctly spelled
spell.known(["beschäftigen"])  # output: set()
spell.correction("beschäftigen")  # output: 'beschã¤ftigen'
spellchecker.__version__  # output: '0.5.1'

User Interface

Is there a way to make a user interface in this code? What plugins can you suggest? Thank you!
And also, is there another scientific solution to use in correcting the problems in capitalization? Thanks!

Wrongly corrects all punctuation into 'a'

>>> from spellchecker import SpellChecker
>>> spell = SpellChecker()
>>> spell.correction('-')
'a'
>>> spell.correction(',')
'a'
>>> from string import punctuation
>>> [spell.correction(punc) for punc in punctuation]
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

Word Frequency Threshold

Per a comment by @ksingha5 in #'12, it would be nice to have the ability to remove all words under a certain threshold.

something like:

from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.remove_by_threshold(10)  # remove all words that have 10 or fewer instances

Update dictionaries

Hello,
First thanks for this project it was really cool to discover it 🎉

I was playing around using the spellchecking functionalities, but then I noticed I started to get unknown words, which were quite "known to me", so I went to check the dictionaries, and it seems those were not there.

Checking the README, I noticed that you base the word list from FrequencyWords so I went there and I looked for these couple of examples of not found words, and they were there.

Particularly, I'm interesting on the Spanish dictionaries, so I downloaded the es_full.txt and created a .gz file, but when I saw the sizes...I didn't want to submit a PR with a new version for the es.json.gz file 😨

472K    es.json.gz
6.9M    es_full.json.gz
300K    es50k.json.gz

I noticed the current es.json.gz includes more words than the 50k version, but less than the es_full, so I would like to know what's the condition to create the dictionaries, so I can maybe submit a PR with an updated file not as large as the 6.9M from the full version.

Thanks for your time 😄

Tiny mistake in how to make a new dictionary

from spellchecker import SpellChecker

spell = SpellChecker(language=None)  # turn off loading a built language dictionary

# if you have a dictionary...
spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json')

# or... if you have text
spell.word_frequency.load_text('./path-to-my-text-doc.txt')

# export it out for later use!
spell.export('my_custom_dictionary.gz', gzipped=True)

this should be

from spellchecker import SpellChecker

spell = SpellChecker(language=None)  # turn off loading a built language dictionary

# if you have a dictionary...
spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json')

# or... if you have text
spell.word_frequency.load_text_file('./path-to-my-text-doc.txt')

# export it out for later use!
spell.export('my_custom_dictionary.gz', gzipped=True)

Fix typos by introducing whitespaces

Is there a way (or workaround) to correct mistakes (typos) also by introducing whitespaces?

eg. "yes specificallyshows [...]" -> "yes specifically shows"

I guess it could be somewhat more complicated, since you have to match with two words, split accordingly, etc.

add custom tokenizer

Hello,
Thanks for your great project!

I would like to know if it is possible to change the tokenizer used for computing frequencies.

For example I use the spaCy tokenizer for french.

for the following french sentence:

"l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée"
the spaCy tokenizer will output the following word sequence

["l'",
 'attaque ',
 'de ',
 'non-concurrence',
 ': ',
 'un ',
 'rendez-vous ',
 'pour ',
 'cette ',
 'nouvelle ',
 'démarches',
 '/',
 'plan']

As a consequence pyspellchecker will always try to correct words having a dash like rendez-vous since it was differently tokenized (i.e. ["rendez", "vous"] which have a completely different meaning)
would it be possible to include a custom tokenizer when loading a custom frequency file (e.g. )

spell.word_frequency.load_text_file("./my_txt_file", tokenizer=spacy_tokenizer)

I did not try but maybe simply using a custom word_frequency dict will do the job?
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')

thanks for your help
Armand

Wrong Corrections, Ignores High Frequency Words

Hi,

I built a custom Arabic dictionary following FrequencyWords and the docs instructions, there is a variation in the words frequency as my corpus is quite small, the issue I have is that the spell checker won't choose the word with the highest frequency, instead it picks words with much lesser frequency, I'm not sure why but it seems that it always chooses the same corrections even after I increased the word frequency of the important words to increase the gap, additionally , these wrong corrections have the same distance of these high frequency words.

does not recognise English

from spellchecker import SpellChecker

spell = SpellChecker()

find those words that may be misspelled

misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

Result -----

File "C:\Users\xyz\Miniconda2\lib\site-packages\spellchecker\spellchecker.py", line 35, in init
raise ValueError(msg)

ValueError: The provided dictionary language (en) does not exist!

Is it possible to add other languages dictionaries?

Hello,

I'm trying to use this library to fix typos done when adding additional information about a problem, but it's no restricted to english and spanish, we would like portuguese etc. Is there a way for me to download a dict and add it to the library?

Automatic Spell Check?

Do I understand your examples correctly, that there is no way to point SpellChecker to a document and have it go through and auto-correct, like a word processor would do? I have to already know - or suspect - what words are misspelled and then ask SpellChecker to verify?

How to add new language

Will you please give me clear instructions or steps ,so that I can add Urdu language,as I'm not able to download the Urdu file from that link which you mentioned.

Extreme memory usage for words near longest word length in the dictionary

It looks like when you use a string that will never match a word but sill fits under the the maximum word length can cause extreme memory usage.

Take the example "57ef934a-dbb0-4978-8626d41c819274". The first set of candidates for distance 1 generates an array of 3,214 words. Sense none of these will match it will then try to generate the list of distance 2 candidates based on each distance 1 candidate which will expand the array out to 10,380,316 words which eats up around 30%+ of our 4gb server. After around 4 or 5 requests the server maxes out memory and freezes. If this package is used to spell correct a user input field, a bot could easily overload the server.

It looks like it should be possible to limit the words generated when iterating over the distance 1 candidates and generating the distance 2 sets. Basically only concat the distance 2 words match the known set and ignore the ones that don't. This would limit the maximum available words to around the total distance * number of words for distance 1. So in this case around 6,428 instead of 10,380,316 words in memory at the same time. This should be possible by modifying the functionality of the __edit_distance_alt and possibly edit_distance_2.

Example:
return [e2 for e1 in tmp for e2 in self.known(self.edit_distance_1(e1))]

I'll make a PR in a bit for this

Set tokenizer on init

Adding the tokenizer to use in the initialization script would make it easier for users to utilize instead of having to pass the tokenizer in for each loading of text or a file.

Wrong correction suggestion

I am trying to use the spellchecker, but it gives the same word as correction even it is not the correct word.

spell_en = SpellChecker()
miss_en = spell_en.unknown(["rlnwatvwv", "reditanstalt", "uivrv", "candidates","gibaatwwxxx"])
for word in miss_en:
    print("misspelled word: ", word)
    print("suggested correction: ", spell_en.correction(word))
    print("candidates: ", spell_en.candidates(word))
    print("----------")

output:
misspelled word: uivrv
suggested correction: livre
candidates: {'livre'}

misspelled word: reditanstalt
suggested correction: reditanstalt
candidates: {'reditanstalt'}

misspelled word: gibaatwwxxx
suggested correction: gibaatwwxxx
candidates: {'gibaatwwxxx'}

misspelled word: rlnwatvwv
suggested correction: rlnwatvwv
candidates: {'rlnwatvwv'}

spellchecker in pypi

If you accidentally install spellchecker instead of pyspellchecker you get something installed and it's version 0.4.0 but it just doesn't work. Is it yours? Can you remove it so people don't accidentally install the wrong library? It's an easy mistake to make when the import command is for spellchecker.

An error with python2.7: ValueError: Invalid mode ('rtb')

I try the testing code:

from spellchecker import SpellChecker

spell = SpellChecker()

an get the error as below :

Traceback (most recent call last):
  File "1.py", line 4, in <module>
    spell = SpellChecker()
  File "C:\Python27\lib\site-packages\spellchecker\spellchecker.py", line 43, in __init__
    self._word_frequency.load_dictionary(full_filename)
  File "C:\Python27\lib\site-packages\spellchecker\spellchecker.py", line 327, in load_dictionary
    with load_file(filename, encoding) as data:
  File "C:\Python27\lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\Python27\lib\site-packages\spellchecker\utils.py", line 27, in load_file
    with gzip.open(filename, mode="rt") as fobj:
  File "C:\Python27\lib\gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "C:\Python27\lib\gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')

The pyspellchecker is installed by : pip install pyspellchecker

Correct a sentense

There is known([words]) and unknown([words]). It would be greate to have correct([words]) or correct(sentence). It seems feasible with both, right ?

Memory Issue for long spellings errors/strings

For testing purposes I have given a word of length 20 and 100, in both the cases my server is giving me unable to allocate memory and my local machine, which is running on 8GB RAM, hangs.

pyspellchecker module and indexer

With version 0.5.5 it does not work as there seems to be a problem with the pyspellchecker module right now.
The stable version at the moment is 0.5.4.

Spellchecker module still relies on the indexer library, which is no longer supported by python3, any update planned?
Thank you in advance.

image

spellchecker.py:422: DeprecationWarning

Hi,
I´m getting this warning:
....spellchecker.py:422: DeprecationWarning: 'encoding' is ignored and deprecated. It will be removed in Python 3.9

Are you planning an update?

Thank you in advance.

Unable to read dictionary

When i am initializing the SpellChecker it throws me error
spell = SpellChecker()
"ValueError: The provided dictionary language (en) does not exist!"

How to fix this?

Export word frequency dictionary

The ability to export a word frequency dictionary would allow us to better support users utilizing a custom dictionary.

Use case:

  1. Use loads the english dictionary and then loads text about programming languages
  2. Instead of having to do this for every instance, it would be better if the user could export or save the dictionary so that they can simply load this dictionary in the future.
  3. This would also allow for spellchecking programs to allow for easier Add Word type functionality.

How to add new words to the dictionary?

From the docs, as far as I understood you can add new words as follows:

from spellchecker import SpellChecker
spell = SpellChecker(distance=1)
spell.word_frequency.load_words(['words', 'to', 'be','added', 'to', 'the', 'system', 'Addis Abeba'])

However, I am trying to correct the word "Addis Abeba", as follows and it doesn't work:

In:

misspelled = spell.unknown(['something', 'is', 'hapenning', 'here', 'Adis abebba'])
for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

Out:

Adis abebba
{'Adis abebba'}
hapenning
{'hapenning'}

Thus, how can I add do my dictionary the word: "Adis abebba" in order to spell and correct words like "Addis abeba" or "Adiis abebbba"?

Function word_probability(word) returns 0.0

Hi, I am not sure how to use word_probability(word) function.
I am currently using pyspellchecker to complete a list of mispelled words. But It gives me an output of 0. And what I need is a list of probability per each candidate in the word list printed before.
Here the code:

import 

myListOfWords = ['medicin', 'increas', 'caus', 'daili', 'reduc', 'healthi', 'vaccin', 'diseas', 'intak', 'peopl', 'realli', 'diabet', 'exercis', 'possibl', 'pressur', 'bodi']

spell = SpellChecker()

for word in myListOfWords :

    print(spell.correction(word)) # gives the  "best candidate" in theory
    print(spell.candidates(word)) # gives the candidates ( i dont understand the order of the words)
    print(spell.word_probability(word)) # here I need the probability of the candidates to see which is the first best and the second best candidates. 

Why Am I doing that? In the code you can see that 'diabet' word returns 'diet' instead of 'diabetes'.

I would like to find an accurate correction related to my topic. As far as I know my options are :

  1. Passing "distance =1" argument in the 'correction' function-> does not correct the problem with 'diabet' word.

  2. Providing a text file dictionary with all the words of my interest as you suggested here (load_text_file. Question, what is the expected format for this txt file? Could you share a example? )

  3. Adding a new function to correct the algorithm based in the terminology I am using (Health related terminology) , by mean of adding a new argument (topic = "Health") and therefore biasing the spell corrections to all the related terminology to that topic. Are you already developing anything like that in the module?

Please could you give me a clue about how to do this (2 & 3 questions)?
Many thanks!

UPDATE
I am using the txt file of medical terms provided by @glutanimate & @dgreuel here.
I think it solved partially the issue for my purposes.

Word frequency/dictionary

Hi. We want to add some words to the dictionary but we were not able to find and access it. How can we possibly access the word frequency/dictionary? Also, can we limit the words suggested by the pyspellchecker cuz in some cases, it suggests a lot words after spell checking.

Please help us out. Thank you!

Ensure tokenizer uses lowercase

Related to #34 - ensure that the provided tokenizer uses lowercase like the default tokenizer.

We should also add some tests for the tokenizer.

Problem to set 'pt' language

SpellChecker is working for any supported languages, except for Portuguese ('pt').
When I try using spell = SpellChecker('pt'), an error message appears saying:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 587200: character maps to <undefined>

I've also tried to load the dictionary the other way around:

spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary('path/to/pt.json.gz', encoding=u'utf-8')

The same error occurs.

I'm using Python 3.6.8 in Windows 10.

Use Trie for spelling check

The spelling checker takes a lot of time checking the spelling of words. Especially if a word is not in the dictionary it takes even longer, which can make spelling check of several words a really long process. Besides the current approach only allows only search up to 2 in distance of levenshtein. Thus it would be convenient to:

  1. Use dynamic programming to calculate edit distances in general, to allow any distance (distances greater than 2 will only make sense when using a trie approach, see point 2).
  2. Use a trie as a data structure for the dictionary to enable faster lookups (instead of computing all words within 1 or 2 edit distances).

Higher Levenshtein Distance than 2

It is not possible to correct Words, that have a higher Levenshtein-Distance than 2. (At least in German).

A parameter to change this would be much appreciated.

Multiple Dictionary Support

Adding a standardized way to store and load different dictionaries will allow the package to be used in different languages.

A few options include:

  • Pickle word_frequency lists
  • JSON formatted word_frequency lists
  • others?

spell.correction is taking way too long time for each word

So, i'm using spell.correction(word) method for correcting a batch of text files in spanish.

After running some tests i noticed that some texts take more than 60 seconds to correct, also i noticed that the misspelled words take 5 to 15 seconds to correct through the spell.correction(word) method.

As you can imagine, for a batch of many texts this bottle-neck in the preprocessing take several hours instead of minutes or seconds.

I haven't inspected the code of that method, but i imagine this has something to do with ranking the levenstein distance of the misspelled word with the rest of the dictionary, that is made to chose the nearest neighbour.

Maybe there could be a way to use an approximate KNN or put a time threshold to the correction logic.

You can replicate this issue using spell.correction(word) for each word of this text:

"no estar de acuerdo con la forma de militarizar la Araucaní­a, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido."

readthedocs support

Getting better documentation of the package will make using and extending it easier for everyone.

  • Add sphinx autodocs
  • Convert README.md to README.rst
  • Improve the docstrings

Version 0.2.0 is broken

Hello again,
This time i'm here because the version 0.2.0 is broken for python 3. It needs the indexer library and this library is only available for python2.
The version 0.1.5 have no errors.
Thanks

Spellchecker for Bahasa (id)

Hi,

Thank you for such a great work, it helps me a lot.
Would it be possible for you to update the resources with Bahasa Indonesia (id)?
I have downloaded the text file from here, https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/nl/id_full.txt, and converted it to json.

then i follow below steps:
from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary('id_full.json')

then i created a list of some misspelled and correct spelled word in Bahasa Indonesia
in here 'makn' is a misspelled word of 'makan', 'skt' --> 'sakit', these are common words in Bahasa
misspelled = spell.unknown(['makn', 'minum', 'sakt', 'kemana', 'gila'])

then i run below loop and it prints nothing
for word in misspelled:
print(spell.correction(word))
print(spell.candidates(word))

Is there something wrong with any of the steps above?

English spellchecking

Hello Team!
I am new to the Project and I have a question.

I use python 3.7 and run into problem with this test program:

from spellchecker import SpellChecker
spell = SpellChecker()                         
split_words = spell.split_words
spell_unknown = spell.unknown

words = split_words("That's how t and s don't fit.")
print(words)
misspelled = spell_unknown(words)
print(misspelled)

With pyspellchecker ver 0.5.4 the printout is:

['that', 's', 'how', 't', 'and', 's', 'don', 't', 'fit']
set()

So free standing 't' and 's' are not marked as errors neither are contractions.

If I change the phrase to:

words = split_words("That is how that's and don't do not fit.")

and use pyspellchecker ver 0.5.6 the printout is:

['that', 'is', 'how', 'that', 's', 'and', 'don', 't', 'do', 'not', 'fit']
{'t', 's'}

So contractions are marked as mistakes again.

(I read barrust comment on Oct 22, 2019}

Please, assist.

What is your definition of "long"?

"If the words that you wish to check are long, it is recommended to reduce the distance to 1". How many characters do you recommend to use for "long"?

Preserve case of letters after correction

The case of letters changes after the correction by spellchecker.unknown() function.
In my use case I need to point out the line numbers of the words that were mistaken.
I use spellchecker.unknown() to find words that were mistyped and search line numbers of these words but since

spellchecker.unknown(['Thankk']) 

will return thankk ( lowercasing the first letter). It is difficult to point out the line number.
Would it be feasible to preserve the Case of letters?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.