I can't find any resources on how to define a custom tokenizer other than the <a href=

For more info about custom tokens, check out the <a href="https://github.com/ianh/owl/

Documentation on Custom Tokenization about owl HOT 8 CLOSED

TheThirdOne commented on August 18, 2024

Documentation on Custom Tokenization

from owl.

Comments (8)

ianh commented on August 18, 2024

For more info about custom tokens, check out the user-defined tokens section of generated-parser.md in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop, for example. There has been some discussion of customizing whitespace handling (see #4), but I'd like to see more examples of what people want to do with it before committing to a design. Are you looking to implement Python-style indentation-based syntax or something else?

from owl.

ianh commented on August 18, 2024

I just realized that the documentation doesn't mention the info parameter explicitly. I added some more detail in beef920.

from owl.

TheThirdOne commented on August 18, 2024

I hadn't seen that section on user defined tokens in the docs.

The full details of how lexing works are in src/x-tokenize.h. Here's where whitespace is handled in the default lexing loop,

I definitely could read over the code carefully to learn what I need to, but having it explained in the docs is nice.

There has been some discussion of customizing whitespace handling (see #4)

I had seen that and at this point I could write a tokenizer that can do any of the whitespace related things that had popped into my head.

Are you looking to implement Python-style indentation-based syntax or something else?

I have a language in mind that occasionally needs newlines explicitly for parsing and a curiosity about how comments were coded without a newline token (I know know that is hardcoded).

It seems that the default tokenizer is pretty different from what I would need to do the type of parsing I want to do. I guess the one question left on my mind is how to completely overwrite it so that only a custom tokenizer is active.

Do lines like this get added by simply not referencing the token in the grammar?

#define IF_NUMBER_TOKEN(...) if (0) { /* no number tokens */  }

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

from owl.

ianh commented on August 18, 2024

Is it even possible for a custom tokenizer to prevent the whitespace slurping here?

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Would it be enough to specify the whitespace characters explicitly? Something like this:

# only treat tabs and spaces as whitespace characters
.whitespace '\t' ' '

Then you could do whatever you want with the newline characters in the tokenizer function.

from owl.

TheThirdOne commented on August 18, 2024

No, not at the moment. As you can maybe guess from its name, I originally wanted to allow overriding owl_default_tokenizer_advance entirely, but I couldn't find a good way to expose the information you'd need to do this cleanly.

Maybe just have another function which can be used instead (by boolean option) which has whitespace, and all other default tokenization removed. Perhaps which a few additional things the custom tokenizer can return to update the offset and such without making a token.

Something like specifying the whitespace characters could work. I think in non-extreme cases, that would work pretty well if whitespace actually has meaning. If nothing were specified, would that allow the tokenizer to specify all whitespace as special and then just handle it in the grammar?

from owl.

ianh commented on August 18, 2024

Yeah, though lacking whitespace could make ambiguity reporting unreliable. For example, a grammar like:

x = y y
y = 'a' | 'a' 'a' | 'aa'

Would be reported as ambiguous ('a' 'a' 'a' -> "aaa"), but due to longest-match tokenization, "aaa" is not a real ambiguity (it's always parsed as "aa" "a"). With whitespace, Owl could report the ambiguity as "a a a", which is legitimately ambiguous. It's possible to check whether this can happen, but I'm leaning toward just giving a warning when reporting ambiguities without any whitespace specified.

from owl.

ianh commented on August 18, 2024

I just pushed a bunch of changes to make .whitespace do what we discussed here. Custom tokenizer functions can also now return tokens with type OWL_WHITESPACE to treat a length of text as whitespace. The documentation should be up-to-date with these changes. Let me know if you have any feedback!

from owl.

ianh commented on August 18, 2024

Closing this issue as resolved. Feel free to open another issue for any problems or questions.

from owl.

Documentation on Custom Tokenization about owl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent