Giter Site home page Giter Site logo

Efficient parsing of UTF-8 about passerine HOT 3 OPEN

vrtbl avatar vrtbl commented on May 10, 2024
Efficient parsing of UTF-8

from passerine.

Comments (3)

slightknack avatar slightknack commented on May 10, 2024

Right now, the lexer is very inefficient, as it doesn't use a state machine. I think we should refactor it; this should be a viable optimization.

As for pulling a crate like https://docs.rs/memchr, Passerine's core depends only on Rust's standard library and has no external dependencies. For this reason, if we were to do vectorization, I suggest we extract the core concept of the library to vectorize as needed.

Thanks for pointing this out! much appreciated :D

from passerine.

slightknack avatar slightknack commented on May 10, 2024

@Plecra I recently completely redid the lexer in big-refactor. Would you mind taking a look at it and letting me know what style/performance improvements can be made? I suspect there's something with the way we're using peekable iterators that could be improved:

/// Parses the next token.
/// Expects all whitespace and comments to be stripped.
fn next_token(&mut self) -> Result<Spanned<Token>, Syntax> {
let mut remaining = self.remaining().peekable();
let (token, len) = match remaining.next().unwrap() {
// separator
c @ ('\n' | ';') => self.take_while(
&mut once(c).chain(remaining).peekable(),
|_| Token::Sep,
|n| n.is_whitespace() || n == ';'
),
// the unit type, `()`
'(' if Some(')') == remaining.next() => {
(Token::Lit(Lit::Unit), 2)
},
// Grouping
'(' => (Token::Open(Delim::Paren), 1),
'{' => (Token::Open(Delim::Curly), 1),
'[' => (Token::Open(Delim::Square), 1),
')' => (Token::Close(Delim::Paren), 1),
'}' => (Token::Close(Delim::Curly), 1),
']' => (Token::Close(Delim::Square), 1),
// Label
c if c.is_alphabetic() && c.is_uppercase() => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| match s {
// TODO: In the future, booleans in prelude as ADTs
"True" => Token::Lit(Lit::Boolean(true)),
"False" => Token::Lit(Lit::Boolean(false)),
_ => Token::Label(s.to_string()),
},
|n| n.is_alphanumeric() || n == '_'
)
},
// Iden
c if c.is_alphabetic() || c == '_' => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| Token::Iden(s.to_string()),
|n| n.is_alphanumeric() || n == '_'
)
},
// Number literal:
// Integer: 28173908, etc.
// Radix: 0b1011001011, 0xFF, etc.
// Float: 420.69, 0.0, etc.
c @ '0'..='9' => {
if c == '0' {
if let Some(n) = remaining.next() {
// Potentially integers in other radixes
self.radix_literal(n, remaining)?
} else {
// End of source, must be just `0`
(Token::Lit(Lit::Integer(0)), 1)
}
} else {
// parse decimal literal
// this could be an integer
// but also a floating point number
self.decimal_literal(once(c).chain(remaining).peekable())?
}
}
// String
'"' => self.string(remaining)?,
// TODO: choose characters for operator set
// don't have both a list and `is_ascii_punctuation`
// Op
c if OP_CHARS.contains(c) => {
self.take_while(
&mut once(c).chain(remaining).peekable(),
|s| Token::Op(s.to_string()),
|n| OP_CHARS.contains(n),
)
},
// Unrecognized char
unknown => return Err(Syntax::error(
&format!(
"Hmm... The character `{}` is not recognized in this context - check for encoding issues or typos",
unknown,
),
&Span::point(&self.source, self.index),
)),
};
let spanned =
Spanned::new(token, Span::new(&self.source, self.index, len));
self.index += len;
Ok(spanned)
}

from passerine.

Plecra avatar Plecra commented on May 10, 2024

It's awfully late, but I just saw this in my email, and I like it 😆. The token representation and parsing logic all flows very nicely. If parsing ever becomes a bottleneck, passerine could return to the once/chain/peekable pattern, since rustc will chug on it a bit.

from passerine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.