Comments (15)
Since a word may be written using combining characters, I believe these characters should be seen as part of the word. If you claim that combining characters are not "word characters", then you shouldn't complain about Python splitting noël
in half on such a "non-word character". In any case, there are open Python and Rust issues on the very subject of combining characters that should be categorised as "word characters", and it looks like these are acknowledged as Python and Rust bugs.
from codespell.
Unicode support is also important to avoid some false positives in documents that use both English and other languages. In my case, this was a document written in English, but with a few French words (for things specific to France, which should not be translated), but also some comments in French, I think. In particular, I got a false positive corresponding to
$ /usr/bin/printf "agre\u0301gation" | codespell -
1: agrégation
agre ==> agree
while I expected no suggestions for "agrégation".
from codespell.
This bug is also present in a rust create
I don't know if this is a dig at #2281 @LilithHafner or just marvelously ironic, but thanks for making me smile!
You've put the same issue in both times @DimitriPapadopoulos .
from codespell.
Not really a bug: noël
contains a character that is not a word character:
>>> "noël".encode('unicode_escape')
b'noe\\u0308l'
>>>
Hence, the default regex splits it into two words separated by that character:
codespell/codespell_lib/_codespell.py
Line 31 in 6c32940
from codespell.
I disagree. IMHO, codespell should include the combining characters in the regexp.
from codespell.
We split words on \W
which "matches any character which is not a word character". It seems to me that the that \W
wrongly matches combining characters. Therefore, I feel this should be addressed in Python - I think python/cpython#56940 is related. Do you agree this is primarily a Python issue?
>>> re.split(r'\W+', 'noël')
['noe', 'l']
>>>
>>> re.split(r'\W+', 'noël')
['noël']
>>>
In the meantime, we could also work around this issue by:
- expanding the default regex to include combining characters, but that seems complex and fragile,
- preprocessing the input text as suggested in Unicode HOWTO | Comparing Strings, but that adds a dependency on module unicodedata and might impact performance for the sake of a few corner cases.
>>> s = 'noël'
>>>
>>> re.split(r'\W+', s)
['noe', 'l']
>>>
>>> re.split(r'\W+', unicodedata.normalize('NFC', s))
['noël']
>>>
from codespell.
Do you agree this is primarily a Python issue?
This bug is also present in a rust create (issue linked in OP) that I doubt has any python deps, so I believe it is a more widespread issue.
from codespell.
This depends on the definition of word characters. Combining characters are rather special characters, which modify word characters rather than really being word characters, IMHO. The GNU grep
utility defines \w
as a synonym for [_[:alnum:]]
(this is a GNU extension, not part of POSIX), and it seems that Python also does, so that its use is incorrect as you can see:
$ printf "noe_" | codespell -
$ printf "noe~" | codespell -
1: noe~
noe ==> not, no, node, know, now
The underscore and the tilde should have been treated in the same way.
Moreover, in GNU grep
, combining characters are not regarded as being part of a word, so that one gets
$ /usr/bin/printf "noe\u0308l\n" | grep 'noe\b'
noël
One may wonder whether this is a bug. However, grep
is a low-level utility that probably predates the combining characters, contrary to codespell
.
from codespell.
According to https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (for a similar issue in GNU grep
), the combining characters are supposed to be regarded as alphabetic characters (thus word characters), which is currently not the case with the GNU C Library. So I've reported a glibc bug (there has already been a bug report in the Debian BTS since 2017, but with no activity). A fix in the glibc should solve the issue in codespell and others as a consequence.
from codespell.
Python or Rust most probably don't use the GNU C library. We'll have to wait for a fix in Python itself.
from codespell.
Note that the rust crate is not using regexes for word splitting but doing it by hand.
from codespell.
Neither does the Python str.split.
Currently, codespell requires re.split because it splits on [\w\-'’ ]
by default, and this does not seem possible using the simpler str.split.
Is split the Rust splitting function you are referring to? It seems equivalent to Python's str.split. I doubt we can use it for our purposes either, unless you can express "all non-word characters and -
and '
and ’
" as a slice of chars or closure.
from codespell.
For the Rust spell checker, typos, we have a hand written implementation that allows us to capture the original string slice and classify what case the string is in a single pass (so we can case-change the correction to match the original style). This needs to be extended to handle UTF8.
from codespell.
If I understand correctly, you classify as WordMode::Boundary
anything not is_lowercase, is_uppercase or is_ascii_digit.
This limits spelling to ASCII characters, which might be good enough for English in most cases, which is the focus of codespell. However, we currently handle gracefully accented characters that are part of imported English words such as soupçon, it would be a pity to lose that.
from codespell.
Additionally, it would be great to be able to extend to other languages at some point, which means you cannot rely on ASCII or Latin-1 characters only.
With that said, I like the classify method, it could help split CamelCase words (split word when is_lowercase precedes is_uppercase). Unfortunately, handling characters individually will most certainly result in a performance hit in Python, unlike Rust.
from codespell.
Related Issues (20)
- codespell exits brutally on ill-formed config file
- .codespellrc file doesn't work HOT 8
- Codespell don't handle KeyboardInterrupt exception
- "--interactive" useless without "-w" HOT 1
- Feature Request: "--builtin=all"
- Feature request: support for lapce editor HOT 3
- Share dictionaries with typos?
- False positive with accented last name HOT 6
- Does `.codespellrc` support comments? HOT 2
- [2.2.6] `--ignore-words-list Nd` has no effect, "Nd" is still reported as a typo HOT 2
- [wishlist]: annotate PDFs
- [wishlist]: Add a way to enter an alternative word in "Choose an option (blank for none):
- [wishlist]: Add a way to enter an alternative word in "Choose an option (blank for none):"
- codespell complains about its `ignore-words-list` in pyproject.toml HOT 5
- Reports spelling error in a format string HOT 4
- RFC: improve diagnostic formatting HOT 4
- How do I found error: OnwerName ==> OwnerName ? HOT 1
- Repo-Review
- Missing typos in dictionary.txt
- pyproject.toml: : How to set private dictionary and standard dictionnary at the same time? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codespell.