Comments (4)
Yeah indeed, this is an unfortunate artifact of the current implementation. I realized this when I wrote it and made exactly this trade off. Basically, I perceive three ways to fix this issue:
- Get rid of nice error messages and just emit what happened without actually pointing at the syntax via carets.
- Use a library like
unicode-width
that attempts to determine the displayed width of&str
andchar
. Although it's not clear to me that it correctly supports graphemes. Maybe it does. - Roll our own version of
unicode-width
with out own Unicode tables.
My philosophy on whether to bring in dependencies for this crate effectively rules out (2). I try to keep the dependency tree very small.
We could go with (1), which means this sort of confusing case won't happen. But it makes error messages substantially worse (IMO) in the common case.
Which I think leaves option (3). I'm open to going that route, but it will need to satisfy the following things:
- It will need to be behind a feature that can be disabled. e.g.,
unicode-error
or something. It can be enabled by default. - The implementation needs to be manageable. That is, I'd expect something even simpler than
unicode-width
. Just a simple table in a similar format as the other Unicode tables, generated byucd-generate
. - The actual code to integrate it into error message formatting is reasonable.
- There should not need to be an implementation of parsing grapheme clusters. I don't mean to say that I know this to be true, but rather, if getting this right means segmenting grahpemes, then I think it's going to eclipse my complexity threshold tolerance for fixing an issue like this.
I don't have any plans to work on this myself any time soon, so PRs are welcome. I would suggest posting a straw man implementation path before a PR so that we can have a meeting of the minds on implementation strategy. One can submit a PR first, but posting a more detailed proposal might help avoid wasting time.
from regex.
What about my last suggestion?
In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".
It would at least allow the user to diagnose the syntax error manually. This doesn't rule out working on a more robust solution like what you describe, but would be a big help with (I think) minimal effort.
from regex.
I would be open to a PR adding the "byte offset" phrasing. That's a decent idea because it's at least more precise.
The "character" phrasing would, I believe, require grapheme segmentation to get correct.
from regex.
One thing I would like to add as somebody who spend way too much time on geapheme width: There is no such thing as a universally agreed upon width for graphene clusters. Unicode essentially shrugs and tells you to ask the font (really the notion of monospace width just doesn't make sense for some more exotic scripts where graphemes are important).
There is a definition which is commonly used for terminals which the unicode-width crate implements. The problem is this is a per codepoint width definition not per-graphem.
Terminal emulators disagree on the exact width (some do grapheme segmentation and break backwards compatability others don't, different emulators support different unicode versions, ...).
unicode-width
is essentially a pretty well optimized lookup table. I implemented something like this myself in the past. The only thing you can do to simplify it is strip out the legacy cjk handling. Their lookup table is actually pretty well optimized (they compress their LUT, there is a lot of redundancy in the table otherwise which blows up binary size).
I am not sure what you policy is with dependencies exactly but it's worth noting that most downstream users that do care about rendering diagnostics probably pull in unicode-width for their own rendering anyway. To avoid having two different lut in the binary using the dependency could be preferable (behind an optional off by default feature flag since I imagine most users do not care).
from regex.
Related Issues (20)
- Add char_range() method for the match type HOT 2
- `regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative HOT 2
- UnicodeSetsMode support (`v` flag mode, `\q`) HOT 9
- Detect if a replacement may allocate HOT 3
- Add method to get full match from `Captures` HOT 3
- Have a way to iterate over sub matches with names included HOT 1
- O(m * n) lookaround
- `meta::Cache::reset` can panic
- Add Min DFA for a regex HOT 23
- Inconsistent behavior with zero-width matches on empty strings
- Valid prefix search (with ^) goes into dead state HOT 3
- The regex parse error while the expre is correct ! HOT 2
- Onepass DFA always has empty captures (user error) HOT 2
- dfa/onepass.rs: index out of bounds HOT 2
- Errors when running quickstart from docs HOT 2
- Add a flag for unescaped literal groups HOT 1
- Inconsistency with is_match and Python's search in Matching Specific Regex Patterns HOT 6
- regex-lite with a &[u8] haystack HOT 2
- Underscore will not match propblaly HOT 2
- Invalid regex with multiple repetition flags is accepted HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from regex.