barrust / pyspellchecker Goto Github PK

View Code? Open in Web Editor NEW

690.0 690.0 101.0 101.48 MB

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

levenshtein-distance python python-spell-checking spellcheck spellchecker spelling-checker

pyspellchecker's People

Contributors

Stargazers

Watchers

Forkers

josephwillard ksingha5 mrjamesriley alexandr877 dst1213 wenyi-xie fpetras y12uc231 excelsimon hicham1007 nishkt mukhtarshaima qianqq davido-brainlabs katkamrachanaso barseghyanartur tflimon ustasneris1984 wellsmuker sagorbrur guoshunhao yogeshparte copperdong fighting41love b2220333 alix-tz rafi138 carstickers surongwei andresnunes wjesus374 tungwini awoziji neverneverendup dav009 filipetheodorodocket cheejiayuan512 avinash1125 maayanorner criderthewriter zhangyanbo2007 himeshph esyyes databill86 rsmahabir sviperm icodein jerryjiang9408 felixgithub2017 madkote jimwangzx aakash710 nuanyang ljpetkovic 10zinten akhmad98 lean-meat 0xproflupin chizhu tuokri nitsel arzhangv lachlanandrew techthiyanes snek-byte mikemalinowski staeff indra-rosadi-rally henrylo1671 shandou lliurex python-repository-hub chlin-intelllex ebell495 msalhab96 graveljp bumbutudor pedrointeraminense ethanm27 johnosbb raivisdejus xezpeleta dinuka-kasun-medis stephencawood dr937 idiotcommerce oren-ron krisskross canslove lafasay cast42 wangcj05 miketschudi jorik041 skillsort mameen jmaslanko

pyspellchecker's Issues

Doen not pick :) or :') or :( as correct words even though they are in my custom text

Doen not pick :) or :') as correct words even though they are in my custom text. What can be the issue here?
Thanks

Non-deterministic correction

When correcting a word, if the first candidates have the same probability it seems that the chosen one will depend on the dictionary (non) order. So the correction is non-deterministic in that case.

It would be great to add an arbitrary but constant order in that case.

Spanish words are not corrected when they missed a tilde (`)

Cancion != Canción

Windows encoding errors not fixed in version 0.5.1

On windows, spellchecker still has problems with encoding errors:

import spellchecker
spell = spellchecker.SpellChecker("de")
# This word is correctly spelled
spell.known(["beschäftigen"])  # output: set()
spell.correction("beschäftigen")  # output: 'beschã¤ftigen'
spellchecker.__version__  # output: '0.5.1'

Is there a way to make a user interface in this code? What plugins can you suggest? Thank you!
And also, is there another scientific solution to use in correcting the problems in capitalization? Thanks!

Wrongly corrects all punctuation into 'a'

>>> from spellchecker import SpellChecker
>>> spell = SpellChecker()
>>> spell.correction('-')
'a'
>>> spell.correction(',')
'a'
>>> from string import punctuation
>>> [spell.correction(punc) for punc in punctuation]
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

Word Frequency Threshold

Per a comment by @ksingha5 in #'12, it would be nice to have the ability to remove all words under a certain threshold.

something like:

from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.remove_by_threshold(10)  # remove all words that have 10 or fewer instances

Update dictionaries

Hello,
First thanks for this project it was really cool to discover it 🎉

I was playing around using the spellchecking functionalities, but then I noticed I started to get unknown words, which were quite "known to me", so I went to check the dictionaries, and it seems those were not there.

Checking the README, I noticed that you base the word list from FrequencyWords so I went there and I looked for these couple of examples of not found words, and they were there.

Particularly, I'm interesting on the Spanish dictionaries, so I downloaded the es_full.txt and created a .gz file, but when I saw the sizes...I didn't want to submit a PR with a new version for the es.json.gz file 😨

472K    es.json.gz
6.9M    es_full.json.gz
300K    es50k.json.gz

I noticed the current es.json.gz includes more words than the 50k version, but less than the es_full, so I would like to know what's the condition to create the dictionaries, so I can maybe submit a PR with an updated file not as large as the 6.9M from the full version.

Thanks for your time 😄

Tiny mistake in how to make a new dictionary

from spellchecker import SpellChecker

spell = SpellChecker(language=None)  # turn off loading a built language dictionary

# if you have a dictionary...
spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json')

# or... if you have text
spell.word_frequency.load_text('./path-to-my-text-doc.txt')

# export it out for later use!
spell.export('my_custom_dictionary.gz', gzipped=True)

this should be

from spellchecker import SpellChecker

spell = SpellChecker(language=None)  # turn off loading a built language dictionary

# if you have a dictionary...
spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json')

# or... if you have text
spell.word_frequency.load_text_file('./path-to-my-text-doc.txt')

# export it out for later use!
spell.export('my_custom_dictionary.gz', gzipped=True)

Fix typos by introducing whitespaces

Is there a way (or workaround) to correct mistakes (typos) also by introducing whitespaces?

eg. "yes specificallyshows [...]" -> "yes specifically shows"

I guess it could be somewhat more complicated, since you have to match with two words, split accordingly, etc.

add custom tokenizer

Hello,
Thanks for your great project!

I would like to know if it is possible to change the tokenizer used for computing frequencies.

For example I use the spaCy tokenizer for french.

for the following french sentence:

"l'attaque de non-concurrence: un rendez-vous pour cette nouvelle rentrée"
the spaCy tokenizer will output the following word sequence

["l'",
 'attaque ',
 'de ',
 'non-concurrence',
 ': ',
 'un ',
 'rendez-vous ',
 'pour ',
 'cette ',
 'nouvelle ',
 'démarches',
 '/',
 'plan']

As a consequence pyspellchecker will always try to correct words having a dash like rendez-vous since it was differently tokenized (i.e. ["rendez", "vous"] which have a completely different meaning)
would it be possible to include a custom tokenizer when loading a custom frequency file (e.g. )

spell.word_frequency.load_text_file("./my_txt_file", tokenizer=spacy_tokenizer)

I did not try but maybe simply using a custom word_frequency dict will do the job?
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')

thanks for your help
Armand

Wrong Corrections, Ignores High Frequency Words

Hi,

I built a custom Arabic dictionary following FrequencyWords and the docs instructions, there is a variation in the words frequency as my corpus is quite small, the issue I have is that the spell checker won't choose the word with the highest frequency, instead it picks words with much lesser frequency, I'm not sure why but it seems that it always chooses the same corrections even after I increased the word frequency of the important words to increase the gap, additionally , these wrong corrections have the same distance of these high frequency words.

does not recognise English

from spellchecker import SpellChecker

spell = SpellChecker()

find those words that may be misspelled

misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

Result -----

File "C:\Users\xyz\Miniconda2\lib\site-packages\spellchecker\spellchecker.py", line 35, in init
raise ValueError(msg)

ValueError: The provided dictionary language (en) does not exist!

Is it possible to add other languages dictionaries?

Hello,

I'm trying to use this library to fix typos done when adding additional information about a problem, but it's no restricted to english and spanish, we would like portuguese etc. Is there a way for me to download a dict and add it to the library?

Automatic Spell Check?

Do I understand your examples correctly, that there is no way to point SpellChecker to a document and have it go through and auto-correct, like a word processor would do? I have to already know - or suspect - what words are misspelled and then ask SpellChecker to verify?

How to add new language

Will you please give me clear instructions or steps ,so that I can add Urdu language,as I'm not able to download the Urdu file from that link which you mentioned.

Extreme memory usage for words near longest word length in the dictionary

It looks like when you use a string that will never match a word but sill fits under the the maximum word length can cause extreme memory usage.

Take the example "57ef934a-dbb0-4978-8626d41c819274". The first set of candidates for distance 1 generates an array of 3,214 words. Sense none of these will match it will then try to generate the list of distance 2 candidates based on each distance 1 candidate which will expand the array out to 10,380,316 words which eats up around 30%+ of our 4gb server. After around 4 or 5 requests the server maxes out memory and freezes. If this package is used to spell correct a user input field, a bot could easily overload the server.

It looks like it should be possible to limit the words generated when iterating over the distance 1 candidates and generating the distance 2 sets. Basically only concat the distance 2 words match the known set and ignore the ones that don't. This would limit the maximum available words to around the total distance * number of words for distance 1. So in this case around 6,428 instead of 10,380,316 words in memory at the same time. This should be possible by modifying the functionality of the __edit_distance_alt and possibly edit_distance_2.

Example:
return [e2 for e1 in tmp for e2 in self.known(self.edit_distance_1(e1))]

I'll make a PR in a bit for this

Damerau-Levenshtein metric

Is possibile to add optionally Damerau-Levenshtein metric?

Contractions and possessives are misspelled

Words like that's and isn't are flagged as misspelled.

Set tokenizer on init

Adding the tokenizer to use in the initialization script would make it easier for users to utilize instead of having to pass the tokenizer in for each loading of text or a file.

Wrong correction suggestion

I am trying to use the spellchecker, but it gives the same word as correction even it is not the correct word.

spell_en = SpellChecker()
miss_en = spell_en.unknown(["rlnwatvwv", "reditanstalt", "uivrv", "candidates","gibaatwwxxx"])
for word in miss_en:
    print("misspelled word: ", word)
    print("suggested correction: ", spell_en.correction(word))
    print("candidates: ", spell_en.candidates(word))
    print("----------")

output:
misspelled word: uivrv
suggested correction: livre
candidates: {'livre'}

misspelled word: reditanstalt
suggested correction: reditanstalt
candidates: {'reditanstalt'}

misspelled word: gibaatwwxxx
suggested correction: gibaatwwxxx
candidates: {'gibaatwwxxx'}

misspelled word: rlnwatvwv
suggested correction: rlnwatvwv
candidates: {'rlnwatvwv'}

spellchecker in pypi

If you accidentally install spellchecker instead of pyspellchecker you get something installed and it's version 0.4.0 but it just doesn't work. Is it yours? Can you remove it so people don't accidentally install the wrong library? It's an easy mistake to make when the import command is for spellchecker.

Demography is changed to Geography in the latest version

The word "Demography " is being spell corrected to "Geograpy", in the latest version.
In version 0.5.5 it is not so.

Could you please correct it?

Add ability to remove words from the dictionary

It would be beneficial to provide a method for users to remove words from the dictionary. This could be used to remove words in the dictionary that are incorrect (such as 'teh' in English)

See #12

An error with python2.7: ValueError: Invalid mode ('rtb')

I try the testing code:

from spellchecker import SpellChecker

spell = SpellChecker()

an get the error as below :

Traceback (most recent call last):
  File "1.py", line 4, in <module>
    spell = SpellChecker()
  File "C:\Python27\lib\site-packages\spellchecker\spellchecker.py", line 43, in __init__
    self._word_frequency.load_dictionary(full_filename)
  File "C:\Python27\lib\site-packages\spellchecker\spellchecker.py", line 327, in load_dictionary
    with load_file(filename, encoding) as data:
  File "C:\Python27\lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\Python27\lib\site-packages\spellchecker\utils.py", line 27, in load_file
    with gzip.open(filename, mode="rt") as fobj:
  File "C:\Python27\lib\gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "C:\Python27\lib\gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')

The pyspellchecker is installed by : pip install pyspellchecker

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 438: ordinal not in range(128)

Hello,
I'm having an issue when trying to call SpellChecker in other languages. English work fine, but all the other throws the following error.

from spellchecker import SpellChecker
spell = SpellChecker('es')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 438: ordinal not in range(128)

Any idea of what's going on and how can i solve it?

Correct a sentense

There is known([words]) and unknown([words]). It would be greate to have correct([words]) or correct(sentence). It seems feasible with both, right ?

Memory Issue for long spellings errors/strings

For testing purposes I have given a word of length 20 and 100, in both the cases my server is giving me unable to allocate memory and my local machine, which is running on 8GB RAM, hangs.

pyspellchecker module and indexer

With version 0.5.5 it does not work as there seems to be a problem with the pyspellchecker module right now.
The stable version at the moment is 0.5.4.

Spellchecker module still relies on the indexer library, which is no longer supported by python3, any update planned?
Thank you in advance.

spellchecker.py:422: DeprecationWarning

Hi,
I´m getting this warning:
....spellchecker.py:422: DeprecationWarning: 'encoding' is ignored and deprecated. It will be removed in Python 3.9

Are you planning an update?

Thank you in advance.

Add Custom Word Freq Dict Sample To Docs

It is probably obvious to most, but would you be able to add the samples from your tests to the documents for how custom word freq dicts and lists should be constructed?

https://github.com/barrust/pyspellchecker/blob/master/tests/resources/small_dictionary.json

Unable to read dictionary

When i am initializing the SpellChecker it throws me error
spell = SpellChecker()
"ValueError: The provided dictionary language (en) does not exist!"

How to fix this?

ValueError: The provided dictionary language (en) does not exist!

Team,

I am using python version 3.7 My .py program is running fine with spell check but when i convert from .py to .exe and run it throws me an error:
ValueError: The provided dictionary language (en) does not exist!

Kindly help

Add Portuguese dictionary

Export word frequency dictionary

The ability to export a word frequency dictionary would allow us to better support users utilizing a custom dictionary.

Use case:

Use loads the english dictionary and then loads text about programming languages
Instead of having to do this for every instance, it would be better if the user could export or save the dictionary so that they can simply load this dictionary in the future.
This would also allow for spellchecking programs to allow for easier Add Word type functionality.

How to add new words to the dictionary?

From the docs, as far as I understood you can add new words as follows:

from spellchecker import SpellChecker
spell = SpellChecker(distance=1)
spell.word_frequency.load_words(['words', 'to', 'be','added', 'to', 'the', 'system', 'Addis Abeba'])

However, I am trying to correct the word "Addis Abeba", as follows and it doesn't work:

In:

misspelled = spell.unknown(['something', 'is', 'hapenning', 'here', 'Adis abebba'])
for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

Out:

Adis abebba
{'Adis abebba'}
hapenning
{'hapenning'}

Thus, how can I add do my dictionary the word: "Adis abebba" in order to spell and correct words like "Addis abeba" or "Adiis abebbba"?

Function word_probability(word) returns 0.0

Hi, I am not sure how to use word_probability(word) function.
I am currently using pyspellchecker to complete a list of mispelled words. But It gives me an output of 0. And what I need is a list of probability per each candidate in the word list printed before.
Here the code:

import 

myListOfWords = ['medicin', 'increas', 'caus', 'daili', 'reduc', 'healthi', 'vaccin', 'diseas', 'intak', 'peopl', 'realli', 'diabet', 'exercis', 'possibl', 'pressur', 'bodi']

spell = SpellChecker()

for word in myListOfWords :

    print(spell.correction(word)) # gives the  "best candidate" in theory
    print(spell.candidates(word)) # gives the candidates ( i dont understand the order of the words)
    print(spell.word_probability(word)) # here I need the probability of the candidates to see which is the first best and the second best candidates.

Why Am I doing that? In the code you can see that 'diabet' word returns 'diet' instead of 'diabetes'.

I would like to find an accurate correction related to my topic. As far as I know my options are :

Passing "distance =1" argument in the 'correction' function-> does not correct the problem with 'diabet' word.
Providing a text file dictionary with all the words of my interest as you suggested here (load_text_file. Question, what is the expected format for this txt file? Could you share a example? )
Adding a new function to correct the algorithm based in the terminology I am using (Health related terminology) , by mean of adding a new argument (topic = "Health") and therefore biasing the spell corrections to all the related terminology to that topic. Are you already developing anything like that in the module?

Please could you give me a clue about how to do this (2 & 3 questions)?
Many thanks!

UPDATE
I am using the txt file of medical terms provided by @glutanimate & @dgreuel here.
I think it solved partially the issue for my purposes.

Word frequency/dictionary

Hi. We want to add some words to the dictionary but we were not able to find and access it. How can we possibly access the word frequency/dictionary? Also, can we limit the words suggested by the pyspellchecker cuz in some cases, it suggests a lot words after spell checking.

Please help us out. Thank you!

Ensure tokenizer uses lowercase

Related to #34 - ensure that the provided tokenizer uses lowercase like the default tokenizer.

We should also add some tests for the tokenizer.

Problem to set 'pt' language

SpellChecker is working for any supported languages, except for Portuguese ('pt').
When I try using spell = SpellChecker('pt'), an error message appears saying:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 587200: character maps to <undefined>

I've also tried to load the dictionary the other way around:

spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary('path/to/pt.json.gz', encoding=u'utf-8')

The same error occurs.

I'm using Python 3.6.8 in Windows 10.

Use Trie for spelling check

The spelling checker takes a lot of time checking the spelling of words. Especially if a word is not in the dictionary it takes even longer, which can make spelling check of several words a really long process. Besides the current approach only allows only search up to 2 in distance of levenshtein. Thus it would be convenient to:

Use dynamic programming to calculate edit distances in general, to allow any distance (distances greater than 2 will only make sense when using a trie approach, see point 2).
Use a trie as a data structure for the dictionary to enable faster lookups (instead of computing all words within 1 or 2 edit distances).

Higher Levenshtein Distance than 2

It is not possible to correct Words, that have a higher Levenshtein-Distance than 2. (At least in German).

A parameter to change this would be much appreciated.

Multiple Dictionary Support

Adding a standardized way to store and load different dictionaries will allow the package to be used in different languages.

A few options include:

Pickle word_frequency lists
JSON formatted word_frequency lists
others?

spell.correction is taking way too long time for each word

So, i'm using spell.correction(word) method for correcting a batch of text files in spanish.

After running some tests i noticed that some texts take more than 60 seconds to correct, also i noticed that the misspelled words take 5 to 15 seconds to correct through the spell.correction(word) method.

As you can imagine, for a batch of many texts this bottle-neck in the preprocessing take several hours instead of minutes or seconds.

I haven't inspected the code of that method, but i imagine this has something to do with ranking the levenstein distance of the misspelled word with the rest of the dictionary, that is made to chose the nearest neighbour.

Maybe there could be a way to use an approximate KNN or put a time threshold to the correction logic.

You can replicate this issue using spell.correction(word) for each word of this text:

"no estar de acuerdo con la forma de militarizar la Araucanía, la manera presora con la cual estan atacando al pueblo mapuche.el delincuencia no ha disminuido."

readthedocs support

Getting better documentation of the package will make using and extending it easier for everyone.

Add sphinx autodocs
Convert README.md to README.rst
Improve the docstrings

Version 0.2.0 is broken

Hello again,
This time i'm here because the version 0.2.0 is broken for python 3. It needs the indexer library and this library is only available for python2.
The version 0.1.5 have no errors.
Thanks

Spellchecker for Bahasa (id)

Hi,

Thank you for such a great work, it helps me a lot.
Would it be possible for you to update the resources with Bahasa Indonesia (id)?
I have downloaded the text file from here, https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/nl/id_full.txt, and converted it to json.

then i follow below steps:
from spellchecker import SpellChecker
spell = SpellChecker(language=None)
spell.word_frequency.load_dictionary('id_full.json')

then i created a list of some misspelled and correct spelled word in Bahasa Indonesia
in here 'makn' is a misspelled word of 'makan', 'skt' --> 'sakit', these are common words in Bahasa
misspelled = spell.unknown(['makn', 'minum', 'sakt', 'kemana', 'gila'])

then i run below loop and it prints nothing
for word in misspelled:
print(spell.correction(word))
print(spell.candidates(word))

Is there something wrong with any of the steps above?

English spellchecking

Hello Team!
I am new to the Project and I have a question.

I use python 3.7 and run into problem with this test program:

from spellchecker import SpellChecker
spell = SpellChecker()                         
split_words = spell.split_words
spell_unknown = spell.unknown

words = split_words("That's how t and s don't fit.")
print(words)
misspelled = spell_unknown(words)
print(misspelled)

With pyspellchecker ver 0.5.4 the printout is:

['that', 's', 'how', 't', 'and', 's', 'don', 't', 'fit']
set()

So free standing 't' and 's' are not marked as errors neither are contractions.

If I change the phrase to:

words = split_words("That is how that's and don't do not fit.")

and use pyspellchecker ver 0.5.6 the printout is:

['that', 'is', 'how', 'that', 's', 'and', 'don', 't', 'do', 'not', 'fit']
{'t', 's'}

So contractions are marked as mistakes again.

(I read barrust comment on Oct 22, 2019}

Please, assist.

What is your definition of "long"?

"If the words that you wish to check are long, it is recommended to reduce the distance to 1". How many characters do you recommend to use for "long"?

Preserve case of letters after correction

The case of letters changes after the correction by spellchecker.unknown() function.
In my use case I need to point out the line numbers of the words that were mistaken.
I use spellchecker.unknown() to find words that were mistyped and search line numbers of these words but since

spellchecker.unknown(['Thankk'])

will return thankk ( lowercasing the first letter). It is difficult to point out the line number.
Would it be feasible to preserve the Case of letters?

barrust / pyspellchecker Goto Github PK

pyspellchecker's People

Contributors

Stargazers

Watchers

Forkers

pyspellchecker's Issues

this should be

find those words that may be misspelled

output: misspelled word: uivrv suggested correction: livre candidates: {'livre'}

misspelled word: reditanstalt suggested correction: reditanstalt candidates: {'reditanstalt'}

misspelled word: gibaatwwxxx suggested correction: gibaatwwxxx candidates: {'gibaatwwxxx'}

misspelled word: rlnwatvwv suggested correction: rlnwatvwv candidates: {'rlnwatvwv'}

Recommend Projects

Recommend Topics

Recommend Org

output:
misspelled word: uivrv
suggested correction: livre
candidates: {'livre'}

misspelled word: reditanstalt
suggested correction: reditanstalt
candidates: {'reditanstalt'}

misspelled word: gibaatwwxxx
suggested correction: gibaatwwxxx
candidates: {'gibaatwwxxx'}

misspelled word: rlnwatvwv
suggested correction: rlnwatvwv
candidates: {'rlnwatvwv'}