Giter Site home page Giter Site logo

Comments (4)

BurntSushi avatar BurntSushi commented on August 25, 2024

Yeah indeed, this is an unfortunate artifact of the current implementation. I realized this when I wrote it and made exactly this trade off. Basically, I perceive three ways to fix this issue:

  1. Get rid of nice error messages and just emit what happened without actually pointing at the syntax via carets.
  2. Use a library like unicode-width that attempts to determine the displayed width of &str and char. Although it's not clear to me that it correctly supports graphemes. Maybe it does.
  3. Roll our own version of unicode-width with out own Unicode tables.

My philosophy on whether to bring in dependencies for this crate effectively rules out (2). I try to keep the dependency tree very small.

We could go with (1), which means this sort of confusing case won't happen. But it makes error messages substantially worse (IMO) in the common case.

Which I think leaves option (3). I'm open to going that route, but it will need to satisfy the following things:

  1. It will need to be behind a feature that can be disabled. e.g., unicode-error or something. It can be enabled by default.
  2. The implementation needs to be manageable. That is, I'd expect something even simpler than unicode-width. Just a simple table in a similar format as the other Unicode tables, generated by ucd-generate.
  3. The actual code to integrate it into error message formatting is reasonable.
  4. There should not need to be an implementation of parsing grapheme clusters. I don't mean to say that I know this to be true, but rather, if getting this right means segmenting grahpemes, then I think it's going to eclipse my complexity threshold tolerance for fixing an issue like this.

I don't have any plans to work on this myself any time soon, so PRs are welcome. I would suggest posting a straw man implementation path before a PR so that we can have a meeting of the minds on implementation strategy. One can submit a PR first, but posting a more detailed proposal might help avoid wasting time.

from regex.

V0ldek avatar V0ldek commented on August 25, 2024

What about my last suggestion?

In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".

It would at least allow the user to diagnose the syntax error manually. This doesn't rule out working on a more robust solution like what you describe, but would be a big help with (I think) minimal effort.

from regex.

BurntSushi avatar BurntSushi commented on August 25, 2024

I would be open to a PR adding the "byte offset" phrasing. That's a decent idea because it's at least more precise.

The "character" phrasing would, I believe, require grapheme segmentation to get correct.

from regex.

pascalkuthe avatar pascalkuthe commented on August 25, 2024

One thing I would like to add as somebody who spend way too much time on geapheme width: There is no such thing as a universally agreed upon width for graphene clusters. Unicode essentially shrugs and tells you to ask the font (really the notion of monospace width just doesn't make sense for some more exotic scripts where graphemes are important).

There is a definition which is commonly used for terminals which the unicode-width crate implements. The problem is this is a per codepoint width definition not per-graphem.

Terminal emulators disagree on the exact width (some do grapheme segmentation and break backwards compatability others don't, different emulators support different unicode versions, ...).

unicode-width is essentially a pretty well optimized lookup table. I implemented something like this myself in the past. The only thing you can do to simplify it is strip out the legacy cjk handling. Their lookup table is actually pretty well optimized (they compress their LUT, there is a lot of redundancy in the table otherwise which blows up binary size).

I am not sure what you policy is with dependencies exactly but it's worth noting that most downstream users that do care about rendering diagnostics probably pull in unicode-width for their own rendering anyway. To avoid having two different lut in the binary using the dependency could be preferable (behind an optional off by default feature flag since I imagine most users do not care).

from regex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.