parsica-php / parsica Goto Github PK
View Code? Open in Web Editor NEWParsica - PHP Parser Combinators - The easiest way to build robust parsers.
Home Page: https://parsica-php.github.io/
License: MIT License
Parsica - PHP Parser Combinators - The easiest way to build robust parsers.
Home Page: https://parsica-php.github.io/
License: MIT License
I am sorry to read about the passing of the maintainer of this library.
I just recently learned about this library and think it's definitely keeping alive in one form or another. mathiasverraes (intentionally not yet pinging you) you mentioned there is no maintainer, what kind of maintenance are you looking for exactly?
Are you able / willing to review PRs, or just able to hand over appropriate permissions to a new maintainer or new maintainer(s)?
From what I can see the library is well tested and merging in at least some of the PRs should be pretty safe. While I don't have a lot of time to maintain the library I am interesting in helping out, the first things I would do:
These steps should make it relatively painless to accept small iterative improvements.
If people want to (help) in maintaining this library please respond here, if mathias doesn't respond in a few days I'll consider pinging him and hopefully he can guide us forward a bit.
The current JSON-parser in Parsica has the same behaviour as json_decode regarding non-unique keys in a JSON-object: it just overwrites the value by the value of the last occurence of that key:
$JSON = '{"key1":"value1","key2":"value2a","key2":"value2b","key3":"value3"}';
$object = json_decode($JSON);
var_dump($object);
gives:
object(stdClass)[1]
public 'key1' => string 'value1' (length=6)
public 'key2' => string 'value2b' (length=7)
public 'key3' => string 'value3' (length=6)
PHP uses RFC 7159 for its JSON-definition, which states in section 4: "The names within an object SHOULD be unique" ("names" == "keys"). If you want to use the parser to check the validity of the JSON-input, then it should give a warning or error. PHP's json_decode doesn't do that; it just overwrites the value.
In the original JSON-definition RFC 4627 object-keys don't need to be unique (although that doesn't make much sense to me). Checking for unique keys makes the JSON-definition context-sensitive. See this posting on Stackoverflow.
In Parsica the key-value-pairs are sequentially written in an associative array, which is then cast in an object.
Parsing context-free languages is generally simpler than context-sensitive languages. Context-sensitivity cannot be expressed in EBNF / ABNF grammars. But context-sensitivity is a necessary property for a Turing-complete language.
Taken from the The Zen of Python https://peps.python.org/pep-0020/ ;)
I think there are a few to many aliases ^^
In response to @mathiasverraes ' tweet here:
I think the documentation is very useful and developer-friendly already. I thought of a couple of things that might be improved:
atLeastOne
, zeroOrMore
, between
. In terms of documentation, I think it would be really helpful if there would be a page that shows some simple regexes and then shows and explains the alternative using parsers.@deprecated
; they are not in fact deprecated, it just seems they are not tested. As a user of the library it makes you feel somewhat bad about using these parsers, even though there's no alternative.Great work on the type annotations by the way, those are really helpful, and I think it's amazing that this is now possible in PHP today.
Thanks a lot for this awesome library! It was a nice thing to dive in and get to know.
I have simple math expression defined like this
$arithmeticExpression->recurse(expression(
$parens($arithmeticExpression)->or($token($term)),
[
leftAssoc(
binaryOperator($token(string("-")), fn($l, $r) => new MinusOperator($l, $r))
),
leftAssoc(
binaryOperator($token(string("+")), fn($l, $r) => new PlusOperator($l, $r))
),
]
));
But this parser succeeds when only term provided. F.e. it parses inputs like
123 + 321
and just
123
Hot to prevent it succeed parsing without any operator?
Since I use collect()
to capture values that are to be fed into an AST node, it would be very helpful to use a map instead of a list. Similarly to how you can capture named groups with regular expressions. Here's an example where I parse a Markdown heading:
return collect(
keepFirst(atLeastOne(char('#')), skipSpace1()),
atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
self::newLineOrEof()
)->map(fn (array $output) => new Heading(strlen($output[1]), $output[2], $output[0]));
Suggested improvement (last line)
return collect(
keepFirst(atLeastOne(char('#')), skipSpace1()),
atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
self::newLineOrEof()
)->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));
One way could be to use label()
for each collected parser and use the label as the array key passed to map()
:
return collect(
keepFirst(atLeastOne(char('#')), skipSpace1())->label('level'),
atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true)))->label('title'),
self::newLineOrEof()
)->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));
The advantage being that label()
is already there and used for a similar purpose. However, this might lead to values being overwritten if you collect two parsers with the same name. Another option is to provide the map upfront:
return collect([
'level' => keepFirst(atLeastOne(char('#')), skipSpace1()),
'title' => atLeastOne(satisfy(fn (string $char) => ! in_array($char, ["\n"], true))),
'eol' => self::newLineOrEof()
])->map(fn (array $output) => new Heading(strlen($output['level']), $output['title']));
It might be useful to have separate collectList()
and collectMap()
functions by the way.
Hi there. Thanks for great project!
I have a quesion. I need to validate AST which I build inside map(...) functions. But I don't know how to keep track of current position whie generating AST nodes to print meaningful messages. Is it possible to pass current position to map() function so it can be saved into AST node?
Wanted to move the discussion from Twitter to here. I looked at the source to evaluate making a PR, and processed through the docs and was struggling with a clear path for multiple reasons.
Firstly, ternary is almost always a ? b : c
, so do you implement with the expectation of those specific tokens? This is a minor question, and the answer is probably, no, don't expect specific tokens, the user provides those.
Secondly, all the current operators work with one symbol, not two and you need two or more symbols so all the Verraes\Parsica\Expression\*Assoc
classes would probably need to change? OR instead there would it be a new ExpressionType
?
Hi! โ This library looks great and I would like to take a closer look.
Let's say I would like to practice on annotations.
/**
* @foo Simple line
* @bar Advanced line with !@#$%^&*()
* @baz Multi line
* with extra line
*/
Could you please point me how to start with it? :-)
Thank you, Felix
Curious, I am trying to extend your expression example, by allowing variable names with "." separators:
$identifier = atLeastOne(alphaChar());
$identifierDot = many($identifier->optional()->append(char('.')));
This isn't working, so I'm clearly not understanding how this library works. What am I missing? I'm stuck trying to think of the problem in terms of a regex solution.
How would I parse / match a variable with arbitrary number of '.' separators? And have that returned to the AST as a complete string, not further parsed.
Any help much appreciated :)
zeroOrMore(satisfy(isCharCode([0x20, 0x0A, 0x0D, 0x09])))
it's potentially faster to do takeWhile on the stream, with the same predicate, skipping the zeroOrMore
combinator and the satisfy
.
https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L154-L158
we rely heavily on between
, which is based on sequence
and bind
. If we would find the tiniest speed improvement in those, we would make the parser a lot faster.
https://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L96-L103
we rely on this function often in the JSON parser too. It is built in terms of sepBy1
which is written in a "readable" way, but not a really efficient way:
function sepBy1(Parser $separator, Parser $parser): Parser
{
$prepend = fn($x) => fn(array $xs): array => array_merge([$x], $xs);
$label = $parser->getLabel() . ", separated by " . $separator->getLabel();
return pure($prepend)->apply($parser)->apply(many($separator->sequence($parser)))->label($label);
}
array_merge
to prepend a single element, could probably be faster with array_unshift
sequence
and bind
the tiniest improvement here would probably make a big differencehttps://github.com/mathiasverraes/parsica/blob/main/src/JSON/JSON.php#L99-L102
https://github.com/mathiasverraes/parsica/blob/main/src/combinators.php#L513-L519
Reported on Twitter https://twitter.com/Wasquen/status/1276906414833758208?s=20
Awesome! Good job!
I'm working on a small library using Parsica I'd like to open source.I'm facing some performance issues.
How would you parse efficiently ordered file paths to build a tree?
[
"/a/b/c/file1",
"/a/b/c/file2",
"/a/b/c/file3",
"/a/b/file4"
]
I'd like to avoid to parse "/a/b/c/" many times.
It's hard to do anything dodgy because I don't have access to remainder/input.
Atm one needs to install it globally, as the composer scripts expect it beeing available. Its not declared to be required though
The main point of this Issue is to seek understanding of the overall intentions of the Stream interface.
I started toying with the library recently and thought it would be cool to implement a TextFileStream
.
After looking over the StringStream
it seemed pretty easy to get worked out using fopen
and PHP's stream functions.
However I ran into a few odd issues and wanted to understand if I was doing it wrong, or if this was an area of improvment for the library. Those issues were:
TakeResult
was marked as internal.Copy of TextFileParser: https://gist.github.com/mallardduck/dd2dab36d0713e5373583e74f2156381
While I was able to get it working, I did notice along the way that TakeResult
is type hinted as a return types.
However it's also marked as @internal
so in PHPStorm it show's as crossed out indicating it shouldn't be used.
See:
So that got me wondering two things: a) should this just be safely ignored or maybe the @internal
tag changed, and b) should a consumer of this library expect to be able to define their own streams at all? Or maybe just not yet?
UPDATE: After some further testing and examining the Stream I wrote I found my flaw.
The root issue were two fold in the end, one caused by me and one by overlooking how the file socket works.
So the one I caused, was that I was overwriting the Position, then setting that on the new one.
So something like:
$this->position = $this->position->advance($chunk);
...
new TextFileStream($this->filePath, $this->position)
That was fixed by leaving the property alone, assigning to the temp variable like StringStream does and then using that.
The second issue was that I was unknowingly letting the file sockets active position drift.
So to fix that I started properly setting the position on the file resource right before taking anything from it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.