Giter Site home page Giter Site logo

Comments (15)

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024 2

Since a word may be written using combining characters, I believe these characters should be seen as part of the word. If you claim that combining characters are not "word characters", then you shouldn't complain about Python splitting noël in half on such a "non-word character". In any case, there are open Python and Rust issues on the very subject of combining characters that should be categorised as "word characters", and it looks like these are acknowledged as Python and Rust bugs.

from codespell.

vinc17fr avatar vinc17fr commented on June 29, 2024 2

Unicode support is also important to avoid some false positives in documents that use both English and other languages. In my case, this was a document written in English, but with a few French words (for things specific to France, which should not be translated), but also some comments in French, I think. In particular, I got a false positive corresponding to

$ /usr/bin/printf "agre\u0301gation" | codespell -
1: agrégation
        agre ==> agree

while I expected no suggestions for "agrégation".

from codespell.

peternewman avatar peternewman commented on June 29, 2024 1

This bug is also present in a rust create

I don't know if this is a dig at #2281 @LilithHafner or just marvelously ironic, but thanks for making me smile!

Python and Rust issues

You've put the same issue in both times @DimitriPapadopoulos .

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

Not really a bug: noël contains a character that is not a word character:

>>> "noël".encode('unicode_escape')
b'noe\\u0308l'
>>> 

Hence, the default regex splits it into two words separated by that character:

word_regex_def = r"[\w\-'’]+"

from codespell.

vinc17fr avatar vinc17fr commented on June 29, 2024

I disagree. IMHO, codespell should include the combining characters in the regexp.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

We split words on \W which "matches any character which is not a word character". It seems to me that the that \W wrongly matches combining characters. Therefore, I feel this should be addressed in Python - I think python/cpython#56940 is related. Do you agree this is primarily a Python issue?

>>> re.split(r'\W+', 'noël')
['noe', 'l']
>>> 
>>> re.split(r'\W+', 'noël')
['noël']
>>> 

In the meantime, we could also work around this issue by:

  • expanding the default regex to include combining characters, but that seems complex and fragile,
  • preprocessing the input text as suggested in Unicode HOWTO | Comparing Strings, but that adds a dependency on module unicodedata and might impact performance for the sake of a few corner cases.
>>> s = 'noël'
>>> 
>>> re.split(r'\W+', s)
['noe', 'l']
>>> 
>>> re.split(r'\W+', unicodedata.normalize('NFC', s))
['noël']
>>> 

from codespell.

LilithHafner avatar LilithHafner commented on June 29, 2024

Do you agree this is primarily a Python issue?

This bug is also present in a rust create (issue linked in OP) that I doubt has any python deps, so I believe it is a more widespread issue.

from codespell.

vinc17fr avatar vinc17fr commented on June 29, 2024

This depends on the definition of word characters. Combining characters are rather special characters, which modify word characters rather than really being word characters, IMHO. The GNU grep utility defines \w as a synonym for [_[:alnum:]] (this is a GNU extension, not part of POSIX), and it seems that Python also does, so that its use is incorrect as you can see:

$ printf "noe_" | codespell -
$ printf "noe~" | codespell -
1: noe~
        noe ==> not, no, node, know, now

The underscore and the tilde should have been treated in the same way.

Moreover, in GNU grep, combining characters are not regarded as being part of a word, so that one gets

$ /usr/bin/printf "noe\u0308l\n" | grep 'noe\b'
noël

One may wonder whether this is a bug. However, grep is a low-level utility that probably predates the combining characters, contrary to codespell.

from codespell.

vinc17fr avatar vinc17fr commented on June 29, 2024

According to https://debbugs.gnu.org/cgi/bugreport.cgi?bug=27681 (for a similar issue in GNU grep), the combining characters are supposed to be regarded as alphabetic characters (thus word characters), which is currently not the case with the GNU C Library. So I've reported a glibc bug (there has already been a bug report in the Debian BTS since 2017, but with no activity). A fix in the glibc should solve the issue in codespell and others as a consequence.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

Python or Rust most probably don't use the GNU C library. We'll have to wait for a fix in Python itself.

from codespell.

epage avatar epage commented on June 29, 2024

Note that the rust crate is not using regexes for word splitting but doing it by hand.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

Neither does the Python str.split.

Currently, codespell requires re.split because it splits on [\w\-'’ ] by default, and this does not seem possible using the simpler str.split.

Is split the Rust splitting function you are referring to? It seems equivalent to Python's str.split. I doubt we can use it for our purposes either, unless you can express "all non-word characters and - and ' and " as a slice of chars or closure.

from codespell.

epage avatar epage commented on June 29, 2024

For the Rust spell checker, typos, we have a hand written implementation that allows us to capture the original string slice and classify what case the string is in a single pass (so we can case-change the correction to match the original style). This needs to be extended to handle UTF8.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

If I understand correctly, you classify as WordMode::Boundary anything not is_lowercase, is_uppercase or is_ascii_digit.

This limits spelling to ASCII characters, which might be good enough for English in most cases, which is the focus of codespell. However, we currently handle gracefully accented characters that are part of imported English words such as soupçon, it would be a pity to lose that.

from codespell.

DimitriPapadopoulos avatar DimitriPapadopoulos commented on June 29, 2024

Additionally, it would be great to be able to extend to other languages at some point, which means you cannot rely on ASCII or Latin-1 characters only.

With that said, I like the classify method, it could help split CamelCase words (split word when is_lowercase precedes is_uppercase). Unfortunately, handling characters individually will most certainly result in a performance hit in Python, unlike Rust.

from codespell.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.