What version of regex are you using? 1.10.2 (latest) <h4 dir="

`regex-syntax` error messages highlight incorrect characters/not handling graphemes correctly about regex HOT 4 OPEN

V0ldek commented on August 25, 2024

`regex-syntax` error messages highlight incorrect characters/not handling graphemes correctly

from regex.

Comments (4)

BurntSushi commented on August 25, 2024

Yeah indeed, this is an unfortunate artifact of the current implementation. I realized this when I wrote it and made exactly this trade off. Basically, I perceive three ways to fix this issue:

Get rid of nice error messages and just emit what happened without actually pointing at the syntax via carets.
Use a library like unicode-width that attempts to determine the displayed width of &str and char. Although it's not clear to me that it correctly supports graphemes. Maybe it does.
Roll our own version of unicode-width with out own Unicode tables.

My philosophy on whether to bring in dependencies for this crate effectively rules out (2). I try to keep the dependency tree very small.

We could go with (1), which means this sort of confusing case won't happen. But it makes error messages substantially worse (IMO) in the common case.

Which I think leaves option (3). I'm open to going that route, but it will need to satisfy the following things:

It will need to be behind a feature that can be disabled. e.g., unicode-error or something. It can be enabled by default.
The implementation needs to be manageable. That is, I'd expect something even simpler than unicode-width. Just a simple table in a similar format as the other Unicode tables, generated by ucd-generate.
The actual code to integrate it into error message formatting is reasonable.
There should not need to be an implementation of parsing grapheme clusters. I don't mean to say that I know this to be true, but rather, if getting this right means segmenting grahpemes, then I think it's going to eclipse my complexity threshold tolerance for fixing an issue like this.

I don't have any plans to work on this myself any time soon, so PRs are welcome. I would suggest posting a straw man implementation path before a PR so that we can have a meeting of the minds on implementation strategy. One can submit a PR first, but posting a more detailed proposal might help avoid wasting time.

from regex.

V0ldek commented on August 25, 2024

What about my last suggestion?

In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".

It would at least allow the user to diagnose the syntax error manually. This doesn't rule out working on a more robust solution like what you describe, but would be a big help with (I think) minimal effort.

from regex.

BurntSushi commented on August 25, 2024

I would be open to a PR adding the "byte offset" phrasing. That's a decent idea because it's at least more precise.

The "character" phrasing would, I believe, require grapheme segmentation to get correct.

from regex.

pascalkuthe commented on August 25, 2024

One thing I would like to add as somebody who spend way too much time on geapheme width: There is no such thing as a universally agreed upon width for graphene clusters. Unicode essentially shrugs and tells you to ask the font (really the notion of monospace width just doesn't make sense for some more exotic scripts where graphemes are important).

There is a definition which is commonly used for terminals which the unicode-width crate implements. The problem is this is a per codepoint width definition not per-graphem.

Terminal emulators disagree on the exact width (some do grapheme segmentation and break backwards compatability others don't, different emulators support different unicode versions, ...).

unicode-width is essentially a pretty well optimized lookup table. I implemented something like this myself in the past. The only thing you can do to simplify it is strip out the legacy cjk handling. Their lookup table is actually pretty well optimized (they compress their LUT, there is a lot of redundancy in the table otherwise which blows up binary size).

I am not sure what you policy is with dependencies exactly but it's worth noting that most downstream users that do care about rendering diagnostics probably pull in unicode-width for their own rendering anyway. To avoid having two different lut in the binary using the dependency could be preferable (behind an optional off by default feature flag since I imagine most users do not care).

from regex.

`regex-syntax` error messages highlight incorrect characters/not handling graphemes correctly about regex HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent