Giter Site home page Giter Site logo

Comments (8)

ianh avatar ianh commented on August 18, 2024

For more info about custom tokens, check out the user-defined tokens section of generated-parser.md in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop, for example. There has been some discussion of customizing whitespace handling (see #4), but I'd like to see more examples of what people want to do with it before committing to a design. Are you looking to implement Python-style indentation-based syntax or something else?

from owl.

ianh avatar ianh commented on August 18, 2024

I just realized that the documentation doesn't mention the info parameter explicitly. I added some more detail in beef920.

from owl.

TheThirdOne avatar TheThirdOne commented on August 18, 2024

I hadn't seen that section on user defined tokens in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop,

I definitely could read over the code carefully to learn what I need to, but having it explained in the docs is nice.

There has been some discussion of customizing whitespace handling (see #4)

I had seen that and at this point I could write a tokenizer that can do any of the whitespace related things that had popped into my head.

Are you looking to implement Python-style indentation-based syntax or something else?

I have a language in mind that occasionally needs newlines explicitly for parsing and a curiosity about how comments were coded without a newline token (I know know that is hardcoded).

It seems that the default tokenizer is pretty different from what I would need to do the type of parsing I want to do. I guess the one question left on my mind is how to completely overwrite it so that only a custom tokenizer is active.

Do lines like this get added by simply not referencing the token in the grammar?

#define IF_NUMBER_TOKEN(...) if (0) { /* no number tokens */  }

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

from owl.

ianh avatar ianh commented on August 18, 2024

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Would it be enough to specify the whitespace characters explicitly? Something like this:

# only treat tabs and spaces as whitespace characters
.whitespace '\t' ' '

Then you could do whatever you want with the newline characters in the tokenizer function.

from owl.

TheThirdOne avatar TheThirdOne commented on August 18, 2024

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Maybe just have another function which can be used instead (by boolean option) which has whitespace, and all other default tokenization removed. Perhaps which a few additional things the custom tokenizer can return to update the offset and such without making a token.

Something like specifying the whitespace characters could work. I think in non-extreme cases, that would work pretty well if whitespace actually has meaning. If nothing were specified, would that allow the tokenizer to specify all whitespace as special and then just handle it in the grammar?

from owl.

ianh avatar ianh commented on August 18, 2024

Yeah, though lacking whitespace could make ambiguity reporting unreliable. For example, a grammar like:

x = y y
y = 'a' | 'a' 'a' | 'aa'

Would be reported as ambiguous ('a' 'a' 'a' -> "aaa"), but due to longest-match tokenization, "aaa" is not a real ambiguity (it's always parsed as "aa" "a"). With whitespace, Owl could report the ambiguity as "a a a", which is legitimately ambiguous. It's possible to check whether this can happen, but I'm leaning toward just giving a warning when reporting ambiguities without any whitespace specified.

from owl.

ianh avatar ianh commented on August 18, 2024

I just pushed a bunch of changes to make .whitespace do what we discussed here. Custom tokenizer functions can also now return tokens with type OWL_WHITESPACE to treat a length of text as whitespace. The documentation should be up-to-date with these changes. Let me know if you have any feedback!

from owl.

ianh avatar ianh commented on August 18, 2024

Closing this issue as resolved. Feel free to open another issue for any problems or questions.

from owl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.