Giter Site home page Giter Site logo

parsica-php / parsica Goto Github PK

View Code? Open in Web Editor NEW
404.0 404.0 17.0 672 KB

Parsica - PHP Parser Combinators - The easiest way to build robust parsers.

Home Page: https://parsica-php.github.io/

License: MIT License

PHP 100.00%
parser parser-combinators php

parsica's People

Contributors

jawn avatar jeroenherczeg avatar mallardduck avatar mathiasverraes avatar matthiasnoback avatar sebastianbergmann avatar staabm avatar turanct avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parsica's Issues

Add phpbench as dev-dependency

Atm one needs to install it globally, as the composer scripts expect it beeing available. Its not declared to be required though

Beginner question - extending expression parser

Curious, I am trying to extend your expression example, by allowing variable names with "." separators:

        $identifier = atLeastOne(alphaChar());
        $identifierDot = many($identifier->optional()->append(char('.')));

This isn't working, so I'm clearly not understanding how this library works. What am I missing? I'm stuck trying to think of the problem in terms of a regex solution.

How would I parse / match a variable with arbitrary number of '.' separators? And have that returned to the AST as a complete string, not further parsed.

Any help much appreciated :)

Where to look for potential JSON parser improvements

1. whitespace parser

zeroOrMore(satisfy(isCharCode([0x20, 0x0A, 0x0D, 0x09])))

it's potentially faster to do takeWhile on the stream, with the same predicate, skipping the zeroOrMore combinator and the satisfy.

https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L154-L158

2. between

we rely heavily on between, which is based on sequence and bind. If we would find the tiniest speed improvement in those, we would make the parser a lot faster.

https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L96-L103

3. sepby

we rely on this function often in the JSON parser too. It is built in terms of sepBy1 which is written in a "readable" way, but not a really efficient way:

function sepBy1(Parser $separator, Parser $parser): Parser
{
    $prepend = fn($x) => fn(array $xs): array => array_merge([$x], $xs);
    $label = $parser->getLabel() . ", separated by " . $separator->getLabel();
    return pure($prepend)->apply($parser)->apply(many($separator->sequence($parser)))->label($label);
}
  • The prepend uses array_merge to prepend a single element, could probably be faster with array_unshift
  • although the applicative is really readable here, it's also a complex operation under the hood, and like the sequence and bind the tiniest improvement here would probably make a big difference

https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L99-L102
https://github.com/mathiasverraes/parsica/blob/main/src/combinators.php#L513-L519

How to parse annotations

Hi! โœ‹ This library looks great and I would like to take a closer look.

Let's say I would like to practice on annotations.

/**
* @foo Simple line
* @bar Advanced line with !@#$%^&*()
* @baz Multi line
*            with extra line
*/

Could you please point me how to start with it? :-)

Thank you, Felix

AST validation [Question]

Hi there. Thanks for great project!

I have a quesion. I need to validate AST which I build inside map(...) functions. But I don't know how to keep track of current position whie generating AST nodes to print meaningful messages. Is it possible to pass current position to map() function so it can be saved into AST node?

User-space Stream implementations and TakeResult

The main point of this Issue is to seek understanding of the overall intentions of the Stream interface.
I started toying with the library recently and thought it would be cool to implement a TextFileStream.
After looking over the StringStream it seemed pretty easy to get worked out using fopen and PHP's stream functions.

However I ran into a few odd issues and wanted to understand if I was doing it wrong, or if this was an area of improvment for the library. Those issues were:

  1. When defining my own Stream I found TakeResult was marked as internal.
  2. Using my own defined stream seems to yield issues with EOF.
  3. In my file based stream characters get eaten and lost.

Copy of TextFileParser: https://gist.github.com/mallardduck/dd2dab36d0713e5373583e74f2156381

user defined Streams and TakeResult

While I was able to get it working, I did notice along the way that TakeResult is type hinted as a return types.
However it's also marked as @internal so in PHPStorm it show's as crossed out indicating it shouldn't be used.
See:
Screen Shot 2020-12-14 at 11 30 45 AM

So that got me wondering two things: a) should this just be safely ignored or maybe the @internal tag changed, and b) should a consumer of this library expect to be able to define their own streams at all? Or maybe just not yet?


UPDATE: After some further testing and examining the Stream I wrote I found my flaw.
The root issue were two fold in the end, one caused by me and one by overlooking how the file socket works.
So the one I caused, was that I was overwriting the Position, then setting that on the new one.

So something like:

$this->position = $this->position->advance($chunk);
...
new TextFileStream($this->filePath, $this->position)

That was fixed by leaving the property alone, assigning to the temp variable like StringStream does and then using that.

The second issue was that I was unknowingly letting the file sockets active position drift.
So to fix that I started properly setting the position on the file resource right before taking anything from it.

Library maintenance

I am sorry to read about the passing of the maintainer of this library.

I just recently learned about this library and think it's definitely keeping alive in one form or another. mathiasverraes (intentionally not yet pinging you) you mentioned there is no maintainer, what kind of maintenance are you looking for exactly?
Are you able / willing to review PRs, or just able to hand over appropriate permissions to a new maintainer or new maintainer(s)?

From what I can see the library is well tested and merging in at least some of the PRs should be pretty safe. While I don't have a lot of time to maintain the library I am interesting in helping out, the first things I would do:

  • Update CI to enforce code style (ecs)
  • Update CI to enforce commit messages
  • Update CI to automate releases based on commit messages
  • Set up git hooks (captainhook) to automate the above for contributors

These steps should make it relatively painless to accept small iterative improvements.

If people want to (help) in maintaining this library please respond here, if mathias doesn't respond in a few days I'll consider pinging him and hopefully he can guide us forward a bit.

Ternary operator

Wanted to move the discussion from Twitter to here. I looked at the source to evaluate making a PR, and processed through the docs and was struggling with a clear path for multiple reasons.

Firstly, ternary is almost always a ? b : c, so do you implement with the expectation of those specific tokens? This is a minor question, and the answer is probably, no, don't expect specific tokens, the user provides those.

Secondly, all the current operators work with one symbol, not two and you need two or more symbols so all the Verraes\Parsica\Expression\*Assoc classes would probably need to change? OR instead there would it be a new ExpressionType?

Feature idea: collect a map intead of a list

Since I use collect() to capture values that are to be fed into an AST node, it would be very helpful to use a map instead of a list. Similarly to how you can capture named groups with regular expressions. Here's an example where I parse a Markdown heading:

return collect(
            keepFirst(atLeastOne(char('#')), skipSpace1()),
            atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
            self::newLineOrEof()
        )->map(fn (array $output) => new Heading(strlen($output[1]), $output[2], $output[0]));

Suggested improvement (last line)

return collect(
            keepFirst(atLeastOne(char('#')), skipSpace1()),
            atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
            self::newLineOrEof()
        )->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));

One way could be to use label() for each collected parser and use the label as the array key passed to map():

return collect(
            keepFirst(atLeastOne(char('#')), skipSpace1())->label('level'),
            atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true)))->label('title'),
            self::newLineOrEof()
        )->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));

The advantage being that label() is already there and used for a similar purpose. However, this might lead to values being overwritten if you collect two parsers with the same name. Another option is to provide the map upfront:

return collect([
            'level' => keepFirst(atLeastOne(char('#')), skipSpace1()),
            'title' => atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
            'eol' => self::newLineOrEof()
        ])->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));

It might be useful to have separate collectList() and collectMap() functions by the way.

Unique keys in JSON-object

The current JSON-parser in Parsica has the same behaviour as json_decode regarding non-unique keys in a JSON-object: it just overwrites the value by the value of the last occurence of that key:

$JSON = '{"key1":"value1","key2":"value2a","key2":"value2b","key3":"value3"}';
$object = json_decode($JSON);
var_dump($object);

gives:

object(stdClass)[1]
  public 'key1' => string 'value1' (length=6)
  public 'key2' => string 'value2b' (length=7)
  public 'key3' => string 'value3' (length=6)

PHP uses RFC 7159 for its JSON-definition, which states in section 4: "The names within an object SHOULD be unique" ("names" == "keys"). If you want to use the parser to check the validity of the JSON-input, then it should give a warning or error. PHP's json_decode doesn't do that; it just overwrites the value.

In the original JSON-definition RFC 4627 object-keys don't need to be unique (although that doesn't make much sense to me). Checking for unique keys makes the JSON-definition context-sensitive. See this posting on Stackoverflow.

In Parsica the key-value-pairs are sequentially written in an associative array, which is then cast in an object.

Parsing context-free languages is generally simpler than context-sensitive languages. Context-sensitivity cannot be expressed in EBNF / ABNF grammars. But context-sensitivity is a necessary property for a Turing-complete language.

Hot to fail expression parser if no operator in input

I have simple math expression defined like this

$arithmeticExpression->recurse(expression(
    $parens($arithmeticExpression)->or($token($term)),
    [
        leftAssoc(
            binaryOperator($token(string("-")), fn($l, $r) => new MinusOperator($l, $r))
        ),
        leftAssoc(
            binaryOperator($token(string("+")), fn($l, $r) => new PlusOperator($l, $r))
        ),
    ]
));

But this parser succeeds when only term provided. F.e. it parses inputs like
123 + 321
and just
123

Hot to prevent it succeed parsing without any operator?

parse efficiently ordered file paths to build a tree

Reported on Twitter https://twitter.com/Wasquen/status/1276906414833758208?s=20

Awesome! Good job!
I'm working on a small library using Parsica I'd like to open source.

I'm facing some performance issues.

How would you parse efficiently ordered file paths to build a tree?

[
  "/a/b/c/file1",
  "/a/b/c/file2",
  "/a/b/c/file3",
  "/a/b/file4"
]

I'd like to avoid to parse "/a/b/c/" many times.

It's hard to do anything dodgy because I don't have access to remainder/input.

Learning Parsica

In response to @mathiasverraes ' tweet here:

I think the documentation is very useful and developer-friendly already. I thought of a couple of things that might be improved:

  • For most developers I think the closest thing to parsing that they already know and understand is regular expressions. At times, as a user of this library I felt like "if I could just use a regex, I'd be done already". Still, I wanted to do the right thing, and I kept looking for similar concepts and found them, like atLeastOne, zeroOrMore, between. In terms of documentation, I think it would be really helpful if there would be a page that shows some simple regexes and then shows and explains the alternative using parsers.
  • I think it will be interesting to have a chapter that shows alternatives for different parsers that achieve the same thing, or how some some parsers are shortcuts of others.
  • The parts of the documentation that say "TODO" should be removed because they aren't useful for the reader, only for the writer/maintainer.
  • The JSON parser is a great source for learning material, and I think it may deserve a full explanation in the documentation.
  • It's very confusing that some parsers are marked as @deprecated; they are not in fact deprecated, it just seems they are not tested. As a user of the library it makes you feel somewhat bad about using these parsers, even though there's no alternative.

Great work on the type annotations by the way, those are really helpful, and I think it's amazing that this is now possible in PHP today.

Thanks a lot for this awesome library! It was a nice thing to dive in and get to know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.