The embedded-ecmascript from arhadthedev

Merge lexical and syntax grammar into a single PEG tree

I propose to alter gh-9 with stopping separating a lexer and a parser to:

avoid the get-a-token-and-the-unparsed-rest-per-call lexer
merge keywords currently implemented as out-of-spec tokens, into the rules in a way native for Pest, the parser generator
give Pest more information for proper error messages and line counting with no our help between a lexer and a parser
not deal with InputElementDiv / InputElementRegExp / InputElementRegExpOrTemplateTail /InputElementTemplateTail / InputElementHashbangOrRegExp lexical goal selection because the parent grammar rule will already determine a subset of expected symbols

Add ES2023 numeric literal tokenization

Todo list:

This issue is a part of:

gh-9

Generate the changelog like modern-cmake does

https://cliutils.gitlab.io/modern-cmake/chapters/intro/newcmake.html generated from https://gitlab.com/CLIUtils/modern-cmake looks like this:

They use gitbook so should we.

Add ES2023 hashbang and common comment tokenization

Todo list:

This issue is a part of:

gh-9

Add CLI wrapper

Planned directory structure:

embedded_ecmascript
- parser/
- runner/
- cli_wrapper.cpp
- python_wrapper.cpp

Add ES2023 whitespace and line terminator tokenization

Todo list:

This issue is a part of:

gh-9

Add ES2023 punctuators tokenization

Todo list:

This issue is a part of:

gh-9

Use a linter

We can use cmake-js NPM package for this.

Add ES2023 parsing

Programmming interface ideas:

impl Display for AnnotatedTextRange {
    fn fmt(&self, f: &mut Formatter) -> fmt::Result {
        write!(f, "line {}, column {}: {}", self.location, self.message)
        Ok(())
    }
}

impl Display for Error {
    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
        // Do not confuse users with internal spec terminology.
        // Everything is an error for practical script developers.
        write!(f, "error: ");
        match *self {
            Error::ParseError(ref info) {
                self.info.fmt(f)?
            },
            Error::EarlyError(ref info) {
                self.info.fmt(f)?
            }
        }
        Ok(())
    }
}

Or replace Display trait with a function taking offset with source text and returning line and column IDs.

Protect `main` branch

To enforce pull requests and their checks we need to add a branch protection rule with the following settings:

Branch name pattern: main
Require a pull request before merging: true
~~Require status checks to pass before merging: true~~ edit: GitHub Action workflows do not count as status checks
~~Lock branch: true~~ edit: it prevents merging pull requests into the branch; Require a pull request before merging already covers our needs
Do not allow bypassing the above settings: true

Note: this issue cannot be resolved until the status checks are added in gh-1.

Add ES2023 string literal tokenization

Todo list:

This issue is a part of:

gh-9

Possible plans on Python library

Example use case if we decide to not follow Node.js path and became integrable into other scripts instead:

from browsername.scripting.native_modules import Window, Io
from browsername.scripting.native_modules.io import NetworkHost, Ip4Address

server = NetworkHost(Ip4Address(127, 0, 0, 1))
serverRequest = HttpTransport()
serverRequest.connectSource(server)
response = InMemoryReceiver()
response.connectSource(serverRequest)

In such a paradigm test can be expressed as follows, on example of tests/__init__.py:

"""
Helper functions and classes used by actual tests (test_*.py files).

Copyright © 2024 Oleg Iarygin <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""

from pathlib import Path
from site import addsitedir
from subprocess import check_output

addsitedir()

def interpret(script: Path):
    check_output([script], stdout)

Also, we should add a directory called docs\ecma262_engine_contributor_manual and put the following python-debug-binaries-musthave.png picture inside:

Add ES2023 regex literal tokenization

Todo list:

This issue is a part of:

gh-9

Add logos of companies that use our library

For example, that's how tokio.rs library site does it:

Fix `clippy::large_stack_frames` in Clippy report

Per-PR and per-commit GitHub Action Style task reports this function allocates a large amount of stack space while not pointing to the function in question:

Run cargo clippy --all-targets --all-features  -- -W clippy::pedantic -W clippy::nursery -W clippy::cargo
    Updating crates.io index
 Downloading crates ...
  Downloaded cfg-if v1.0.0
[snip]
    Checking embedded-ecmascript v0.1.0 (/home/runner/work/embedded-ecmascript/embedded-ecmascript)
error: this function allocates a large amount of stack space
  |
  = note: allocating large amounts of stack space can overflow the stack
  = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#large_stack_frames
  = note: `-D clippy::large-stack-frames` implied by `-D warnings`
  = help: to override `-D warnings` add `#[allow(clippy::large_stack_frames)]`

error: could not compile `embedded-ecmascript` (test "conformance_ecma[26](https://github.com/arhadthedev/embedded-ecmascript/actions/runs/8441487588/job/23120772824#step:3:27)2compilers") due to 1 previous error
Error: Process completed with exit code 101.

Such a message fails the whole style check. As a result, we cannot say if the corresponding PR is otherwise correct or not without opening a full report and scrolling through it. It certanly will lead to a total ignorance of a failing status because of a hassle of browsing hundreds of lines after each and every commit into each and every PR.

So we need to either address the issue if caused by our code or silence the warning if caused by lots of local variables inside #[rstest] and #[values(...)] macro expansion.

Add ECMAScript standard library

Unlike runtime extensions from #64, standard library modules export not integrate(targetGlobalThis) but way lower-level integrate(globalEnvironmentRecord) inaccessible for the extensions.

Example of the function for ECMA-402 (internationalization API specification):

export function integrate(globalEnvironmentRecord) {
  const Intl = {
    Collator: function(locales = undefined, options = undefined) {
      // ...
    }
  };
  targetGlobalObject.Intl = Intl;
}

Use language-invariant SDDL string instead of aliases for ACLs

We need to create profile and cache files with not generation from bits and pieses but with precalculated strings like "D:P(A;OICI;FA;;;SY)(A;OICI;FA;;;BA)(A;OICI;FA;;;OW)".

It should avoid any localized group naming issue like python/cpython#118773.

For implementation details of the precalculated string see python/cpython#118800.

Address `clippy::must_use_candidate` for ECMA-262 static semantics

While implementing gh-120, I got the following recommendations from Clippy for lines not affected by the PR:

error: this method could have a `#[must_use]` attribute
   --> src/lexical_grammar.rs:105:5
    |
105 |     pub fn string_value(&self) -> String {
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: add the attribute: `#[must_use] pub fn string_value(&self) -> String`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#must_use_candidate
    = note: `-D clippy::must-use-candidate` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::must_use_candidate)]`

error: this method could have a `#[must_use]` attribute
   --> src/lexical_grammar.rs:121:5
    |
121 |     pub fn string_value(&self) -> String {
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: add the attribute: `#[must_use] pub fn string_value(&self) -> String`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#must_use_candidate

error: this method could have a `#[must_use]` attribute
   --> src/lexical_grammar.rs:653:5
    |
653 |     pub fn string_value(&self) -> &str {
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: add the attribute: `#[must_use] pub fn string_value(&self) -> &str`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#must_use_candidate

Since CI is configured to treat Clippy warnings as errors, we need to address them before proceeding with the original gh-120.

Replace flattenened UnpackedToken with ECMA-262 static semantics via original tree in Token

From #113 (comment):

Since the whole purpose of flattening is token precalculation, we can replace it with our custom ECMA-262-style static semantics (recurrent impl functions).

The grouping of grammar and ORM gave an idea to develop the separation further sunce the tree classes no longer get on the way so we can open and forget them.

Edit: extra reasoning is to expose full lexer grammar tree to users.

The current implementation flattens a grammar tree into a enum of leaf tokens. However, it requires a tokenizer user to constantly perform mental conversion between the two-tree merge defined in specification, and the flat list.

Since the whole purpose of flattening is token precalculation, we replace it with our custom ECMA-262-style static semantics (recurrent impl functions).

Automatically run tests for PRs

We need to ensure that pull requests and freshly merged commits pass the following:

library unit tests (added in gh-3)
https://github.com/tc39/test262-parser-tests conformance tests (added in gh-5)
add safe lints (added in gh-6)
add all lints applicable (added in gh-7)

Also ensure that no PR with failing test can be merged.

Add ES2023 tokenization

For starters, it can be Tokenizer struct right inside the crate.

The struct should recognize lexical grammar described in https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar.

The struct should provide methods to switch between goal symbols on the fly. The spec section linked above gives the following reasoning for such a feature:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted.

Syntactic grammar contexts is defined by a parser state so it's the parser that needs to switch a current lexical grammar goal symbol to adjust tokenization.

The user entry point is Tokenizer object with its get_next_symbol method:

/// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
///
/// > There are several situations where the identification of lexical input
/// > elements is sensitive to the syntactic grammar context that is consuming
/// > the input elements. This requires multiple goal symbols for the lexical
/// > grammar.
pub enum GoalSymbols {
    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used at the start of a Script or Module
    InputElementHashbangOrRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in syntactic grammar contexts where a `RegularExpressionLiteral`,
    /// > a `TemplateMiddle`, or a `TemplateTail` is permitted.
    InputElementRegExpOrTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a
    /// > `RegularExpressionLiteral` is permitted but neither a
    /// > `TemplateMiddle`, nor a TemplateTail is permitted.
    InputElementRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a `TemplateMiddle` or a
    /// > `TemplateTail` is permitted but a `RegularExpressionLiteral` is
    /// > not permitted
    InputElementTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > In all other contexts, [...] used as the lexical goal symbol.
    InputElementDiv
}

#[derive(Debug)]
pub struct Tokenizer {
    /// A current set of lexical grammar
    current_goal: GoalSymbols,
}

Grammar parameter implementation

For notation like Production[Param1, Param2] add the following comment into the code:

Notes on how specification features are mapped into YACC features:

grammatical parameters (like Nonterminal[Param1, Param2]) are implemented as a special FecerBrowser_...KeywordUsage static semantics returning a boolean, one for each pair of a left side parameter and its right side usage

lookahead restriction

[No Line Terminator Here]

automatic semicolon insertion

Cover grammar

Also add a note on what cover grammar is.

Todo list:

Add ES2023 common token tokenization

Todo list:

CommonToken

This issue depends on:

This issue is a part of:

gh-9

Add ES2023 template literal tokenization

Todo list:

This issue is a part of:

gh-9

Add ES2023 names tokenization

Todo list:

This issue is a part of:

gh-9

Replace pest with faster-pest

Faster-pest promises to be a faster drop-in replacement of pest:

Welcome to faster-pest, a high-performance code generator for Parsing Expression Grammars. faster-pest is an unofficial pro-macro providing next-level implementations of Pest parsers. It uses low-level optimization tricks under the hood to generate highly optimized code which minimizes the overhead of the AST recognition process, resulting in much faster parsing.

[...]

The parsing approach used under the hood has nothing in common with the original pest code. To be honest, I never looked at the pest codebase, because it was easier to start from scratch. There is still one thing that was not reimplemented: the parsing of the actual pest grammar. However, this might not last. I need to extend the grammar to enable more advanced tricks, like making it possible to define complex rules with Rust code and import them in a pest grammar.

[...]

Only a week after its creation, faster-pest already parses Json at 705% the speed of Pest and 137% the speed of Nom. This places faster-pest on par with serde_json. faster-pest allows you to approach limits that only SIMD-powered parsers can overcome.

Limitations:

Limited syntax support (Missing: stack, insens, pospred)

The tokens API of Pest is not supported (you probably didn't use that)

Error printing is made for Linux

Errors can be obscure when a repetition ends prematurely

Not everything has been tested and there could be incorrect parsing behavior

Attention: it is destributed under GPL-3.0 that would force us to abandon MIT License. Probably, we can provide faster-pest feature flag that replaces pest and makes the built binary GPL-only. Also, we should benchmark both variants.

To use faster-pest we need to change Cargo.toml as follows:

-pest = "2.7.10"
-pest_derive = "2.7.10"
+faster-pest = "0.1.4"

This issue is a follow-up to gh-76.

Generalize tests by accepting expected tokens

Currently crate::tests::with_term assumes that all match_* functions return Optional<((), &str)>. However, this is correct for ignored characters only, like spaces and comments. All other lexer brick functions return a struct instead of ().

So we need to prepare crate::tests::with_term to this by passing expected results via a parameter.

Parent issue: gh-16

Separate lexing/parsing and execution with serialized tree hints

While doing gh-9 and gh-113, I got the idea that pest-ast does not work well with saving of parse results for caching. In rules with repetition it creates in-heap Vec objects we would like to avoid for the sake of performance. What's the point to create an array to immediately destroy it after serialiation?

As a result, I propose to replace pest-ast object tree generation with manual recurrent token parsing with in-place serialiation.

Replace PEG with LALR

I found a way to simplify a manually written LALR parser to the level of the current PEG grammar file; a PR is pending along with some preparations.

This replaces both grammar.pest and lexical_grammar.pest.

Supersedes gh-131.

Add ES2023 keyword literal tokenization

Todo list:

This issue is a part of:

gh-9

Move pest_ast classes for lexical_grammar.pest into lexical_grammar.rs

The typed AST structs take more than 600 lines of code in lib.rs. Thus, we need to move everything related to lexical_grammar.pest into a separate module, lexical_grammar.rs.

As a side effect, we hide all pest-ast machinery leaving just classes for outside users.

Parent issue:

gh-9

Add ability to integrate the runtime with the environment

Intruduce runtime extensions - plugin modules that safely interface between outside C libraries and the engine.

Each module library provides the same-named .mjs file with integrate(targetGlobalThis) exported function called for each runtime created. The function adds a class to globalThis.{engine}.runtime_extensions. For example, window.mjs for window.dll should add globalThis.{engine}.runtime_extensions.Window. The {engine} part is up to consideration.

We use an extra .mjs file becase the extension library uses primitive C types and a few structures with no ownership over them. Otherwise, there will be enough inconsistencies and leaks (like in CPython modules where they sometimes surface and get fixed, especially leaks). So the .mjs wraps the primitive types as ECMAScript object under supervision of the runtime, using the same tested machinery as for built-in types.
For the library the .mjs calls globalThis.{engine}.call_native(exported_id, args...).

Primitive types planned for native libraries: integer, real, utf8 strings. No Null, Nothing, objects, or callbacks, .mjs interface creates and maintains them.

Successfully pass test262-parser-tests

ECMA Technical Committee 39 has a compiler-only conformance test suit, https://github.com/tc39/test262-parser-tests.

We need to successfully pass it before proceeding to script execution. For starters, we stick to tc39/test262-parser-tests@0e808c7.

Turn-on order and the corresponding parser features to be implemented:

pass directory (postpone error detection until the next stage)
pass-explicit (implement automatic semicolon insertion here)
fail (implement syntax error pinpointing)
early

Convert manual lexer bits to pest

Currently we implement each and every grammar production in https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar as an independent function trying out all right-side options one by one until succeeds. For each right-side nonterminal it calls the corresponding function implemented in the same fashion as the caller.

Examples from https://github.com/arhadthedev/embedded-ecmascript/blob/f99668a01053ef5828aaf34935a794d9337e022b/src/_tokenizer/space.rs:

pub fn match_line_terminator_sequence(text: &str) -> Option<((), &str)> {
    match_lf(text)
        .or_else(|| match_crlf(text)) // Try greedy match_crlf before match_cr
        .or_else(|| match_cr(text))
        .or_else(|| match_ls(text))
        .or_else(|| match_ps(text))
}

fn match_crlf(text: &str) -> Option<((), &str)> {
    text.strip_prefix("\u{000D}\u{000A}").map(|tail| ((), tail))
}

pub fn match_cr(text: &str) -> Option<((), &str)> {
    text.strip_prefix('\u{000D}').map(|tail| ((), tail))
}

However, such an approach makes repetitive calls for str::strip_prefix as many times as it attempts to match the rules. Such a slowdown can be replaced with a dedicated parser generator like https://github.com/pest-parser/pest.

This issue is supplement to gh-9 aimed to already implemented grammar processing.

arhadthedev / embedded-ecmascript Goto Github PK

embedded-ecmascript's Introduction

This is embedded-ecmascript version 0.1.0-dev

Copyright and License Information

embedded-ecmascript's People

Contributors

Watchers

embedded-ecmascript's Issues

Grammar parameter implementation

Cover grammar

Recommend Projects

Recommend Topics

Recommend Org