Word splitting sometimes fails with accents about codespell HOT 15 CLOSED

LilithHafner commented on June 29, 2024

Word splitting sometimes fails with accents

from codespell.

Comments (15)

DimitriPapadopoulos commented on June 29, 2024 2

Since a word may be written using combining characters, I believe these characters should be seen as part of the word. If you claim that combining characters are not "word characters", then you shouldn't complain about Python splitting noël in half on such a "non-word character". In any case, there are open Python and Rust issues on the very subject of combining characters that should be categorised as "word characters", and it looks like these are acknowledged as Python and Rust bugs.

from codespell.

vinc17fr commented on June 29, 2024 2

Unicode support is also important to avoid some false positives in documents that use both English and other languages. In my case, this was a document written in English, but with a few French words (for things specific to France, which should not be translated), but also some comments in French, I think. In particular, I got a false positive corresponding to

$ /usr/bin/printf "agre\u0301gation" | codespell -
1: agrégation
        agre ==> agree

while I expected no suggestions for "agrégation".

from codespell.

peternewman commented on June 29, 2024 1

This bug is also present in a rust create

I don't know if this is a dig at #2281 @LilithHafner or just marvelously ironic, but thanks for making me smile!

Python and Rust issues

You've put the same issue in both times @DimitriPapadopoulos .

from codespell.

DimitriPapadopoulos commented on June 29, 2024

Not really a bug: noël contains a character that is not a word character:

>>> "noël".encode('unicode_escape')
b'noe\\u0308l'
>>>

Hence, the default regex splits it into two words separated by that character:

codespell/codespell_lib/_codespell.py

Line 31 in 6c32940

word_regex_def = r"[\w\-'’]+"

from codespell.

vinc17fr commented on June 29, 2024

I disagree. IMHO, codespell should include the combining characters in the regexp.

from codespell.

DimitriPapadopoulos commented on June 29, 2024

We split words on \W which "matches any character which is not a word character". It seems to me that the that \W wrongly matches combining characters. Therefore, I feel this should be addressed in Python - I think python/cpython#56940 is related. Do you agree this is primarily a Python issue?

>>> re.split(r'\W+', 'noël')
['noe', 'l']
>>> 
>>> re.split(r'\W+', 'noël')
['noël']
>>>

In the meantime, we could also work around this issue by:

expanding the default regex to include combining characters, but that seems complex and fragile,
preprocessing the input text as suggested in Unicode HOWTO | Comparing Strings, but that adds a dependency on module unicodedata and might impact performance for the sake of a few corner cases.

>>> s = 'noël'
>>> 
>>> re.split(r'\W+', s)
['noe', 'l']
>>> 
>>> re.split(r'\W+', unicodedata.normalize('NFC', s))
['noël']
>>>

from codespell.

LilithHafner commented on June 29, 2024

Do you agree this is primarily a Python issue?

This bug is also present in a rust create (issue linked in OP) that I doubt has any python deps, so I believe it is a more widespread issue.

from codespell.

vinc17fr commented on June 29, 2024

This depends on the definition of word characters. Combining characters are rather special characters, which modify word characters rather than really being word characters, IMHO. The GNU grep utility defines \w as a synonym for [_[:alnum:]] (this is a GNU extension, not part of POSIX), and it seems that Python also does, so that its use is incorrect as you can see:

$ printf "noe_" | codespell -
$ printf "noe~" | codespell -
1: noe~
        noe ==> not, no, node, know, now

The underscore and the tilde should have been treated in the same way.

Moreover, in GNU grep, combining characters are not regarded as being part of a word, so that one gets

$ /usr/bin/printf "noe\u0308l\n" | grep 'noe\b'
noël

One may wonder whether this is a bug. However, grep is a low-level utility that probably predates the combining characters, contrary to codespell.

from codespell.

vinc17fr commented on June 29, 2024

According to https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (for a similar issue in GNU grep), the combining characters are supposed to be regarded as alphabetic characters (thus word characters), which is currently not the case with the GNU C Library. So I've reported a glibc bug (there has already been a bug report in the Debian BTS since 2017, but with no activity). A fix in the glibc should solve the issue in codespell and others as a consequence.

from codespell.

DimitriPapadopoulos commented on June 29, 2024

Python or Rust most probably don't use the GNU C library. We'll have to wait for a fix in Python itself.

from codespell.

epage commented on June 29, 2024

Note that the rust crate is not using regexes for word splitting but doing it by hand.

from codespell.

DimitriPapadopoulos commented on June 29, 2024

Neither does the Python str.split.

Currently, codespell requires re.split because it splits on [\w\-'’ ] by default, and this does not seem possible using the simpler str.split.

Is split the Rust splitting function you are referring to? It seems equivalent to Python's str.split. I doubt we can use it for our purposes either, unless you can express "all non-word characters and - and ' and ’" as a slice of chars or closure.

from codespell.

epage commented on June 29, 2024

For the Rust spell checker, typos, we have a hand written implementation that allows us to capture the original string slice and classify what case the string is in a single pass (so we can case-change the correction to match the original style). This needs to be extended to handle UTF8.

from codespell.

DimitriPapadopoulos commented on June 29, 2024

If I understand correctly, you classify as WordMode::Boundary anything not is_lowercase, is_uppercase or is_ascii_digit.

This limits spelling to ASCII characters, which might be good enough for English in most cases, which is the focus of codespell. However, we currently handle gracefully accented characters that are part of imported English words such as soupçon, it would be a pity to lose that.

from codespell.

DimitriPapadopoulos commented on June 29, 2024

Additionally, it would be great to be able to extend to other languages at some point, which means you cannot rely on ASCII or Latin-1 characters only.

With that said, I like the classify method, it could help split CamelCase words (split word when is_lowercase precedes is_uppercase). Unfortunately, handling characters individually will most certainly result in a performance hit in Python, unlike Rust.

from codespell.

Word splitting sometimes fails with accents about codespell HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent