ruby / prism Goto Github PK

Prism Ruby parser

Home Page: https://ruby.github.io/prism/

License: MIT License

Ruby 31.51% C 61.52% Makefile 0.27% Java 1.71% Dockerfile 0.05% Shell 0.12% Rust 3.20% HTML 0.53% JavaScript 1.06% C++ 0.02%

prism's Introduction

Prism Ruby parser

This is a parser for the Ruby programming language. It is designed to be portable, error tolerant, and maintainable. It is written in C99 and has no dependencies.

Overview

The repository contains the infrastructure for both a shared library (libprism) and a native CRuby extension. The shared library has no bindings to CRuby itself, and so can be used by other projects. The native CRuby extension links against ruby.h, and so is suitable in the context of CRuby.

.
├── Makefile              configuration to compile the shared library and native tests
├── Rakefile              configuration to compile the native extension and run the Ruby tests
├── bin
│   ├── lex               runs the lexer on a file or string, prints the tokens, and compares to ripper
│   ├── parse             runs the parse on a file or string and prints the AST
│   └── prism             a CLI for development and debugging
├── config.yml            specification for tokens and nodes in the tree
├── doc                   documentation website
├── docs                  markdown documentation about the project
├── ext
│   └── prism
│       ├── extconf.rb    configuration to generate the Makefile for the native extension
│       └── extension.c   the native extension that interacts with libprism
├── fuzz                  files related to fuzz testing
├── gemfiles              gemfiles used by different Ruby versions in CI
├── include
│   ├── prism             header files for the shared library
│   └── prism.h           main header file for the shared library
├── java                  Java bindings for the shared library
├── java-wasm             Java WASM bindings for the shared library
├── javascript            JavaScript WASM bindings for the shared library
├── lib
│   ├── prism             Ruby library files
│   └── prism.rb          main entrypoint for the Ruby library
├── rakelib               various Rake tasks for the project
├── rbi                   RBI type signatures for the Ruby library
├── rust
│   ├── ruby-prism        Rustified crate for the shared library
│   └── ruby-prism-sys    FFI binding for Rust
├── sig                   RBS type signatures for the Ruby library
├── src
│   ├── util              various utility files
│   └── prism.c           main entrypoint for the shared library
├── templates             contains ERB templates generated by templates/template.rb
│   └── template.rb       generates code from the nodes and tokens configured by config.yml
└── test
    └── prism
        ├── fixtures      Ruby code used for testing
        └── snapshots     snapshots of generated syntax trees corresponding to fixtures

Getting started

To compile the shared library, you will need:

C99 compiler
GNU make
Ruby 2.7.0 or later

Once you have these dependencies, run:

bundle install

to fetch the Ruby dependencies. Finally, run:

bundle exec rake compile

to compile the shared library. It will be built in the build directory. To test that everything is working, run:

bin/parse -e "1 + 2"

to see the syntax tree for the expression 1 + 2.

Contributing

See the CONTRIBUTING.md file for more information. We additionally have documentation about the overall design of the project as well as various subtopics.

Examples

Prism has been integrated into the majority of Ruby runtimes, many libraries, and some applications. Below is a list of some of the projects that use Prism:

Runtimes

CRuby (via C)
Garnet (via WASM)
JRuby (via Java)
Natalie (via C++ and Ruby)
Opal (via Ruby and WASM)
TruffleRuby (via Java)

Libraries

Applications

gem.sh

prism's People

Contributors

Stargazers

Watchers

Forkers

clownsw geetfun michaeltaras drbragg hahmed fncontroloption ccocchi dmitrytsepelev hparker garyhtou-forks composerinteralia haldun max-leopold qpc-github quantum-platinum-cloud orhantoy mrzasa enebo eregon andrykonchin nevans iq-scm iancanderson flavorjones msp-greg s4na tenderlove ojab jemmaissroff stevenjohnstone k0kubun turboladen ianks juancrg90 y9v nobu giovannism20 vinistock cxz andyw8 froydnj marcoroth shaver morriar burdettelamar clown-team gazayas olleolleolle thomasmarshall eileencodes maumagnaguagno yasulab seven1m alitaso345 makenowjust mateusdeap jcoyne nevinera ryanbrushett wildmaples pcai kcdragon faraazahmad rgarver martinemde davidwessman derekcmoore lynne-ashminov hexdevs heyogrady camertron muhammadibtisam mattboldt dhavalsingh casperisfine speak2erase josacar wassson ngroman joshuay03 k0va1 tompng cafonso riseshia matheusrich nixme amirrajan r7kamura astrabacus kjtsanaktsidis hasumikin sanjaii mutecipher matthew-healy djensenius nikhilbhatt dgollahon st0012 m-nakamura145 andreatp

prism's Issues

Pattern matching

This is a whole ticket for supporting the pattern matching syntax. There are a bunch of subtasks on this, and it shouldn't be attempted until #137 is closed. For documentation, check https://github.com/ruby/ruby/blob/master/doc/syntax/pattern_matching.rdoc.

In clauses (case foo; in bar; end)
Array patterns (case foo; in [bar]; end)
Find patterns (case foo; in [*, :bar, *]; end)
#655
Pinned expressions (^(foo))
Pinned locals (^foo)
Rightward assignment (foo => bar)

If you want to grab a task out of this, convert it into its own issue first.

Deconstructing required parameters

We need to handle the syntax for deconstructing required parameters. This looks like:

def foo((bar, baz)); end

This should leverage the code we have for MultiTargetNode, since it's the same kind of parsing. We should wait until #134 is done to take advantage of that work.

Hash literals

We need to parse hashes. Right now we're not parsing them at all. This includes the addition of three nodes:

HashNode - a node that is either made with an explicit hash literal or an implicit hash literal in an argument list
AssocNode - a node that contains a key and a value in a hash, in addition to an optional operator between them (the hash rocket). If a label is used, then this field will be marked as not provided
AssocSplat - a node that represents splatting an object into a hash, as in **foo

If you want to grab a task, convert it into its own issue first.

String concatenation

We need a node to represent string concatenation nodes. This is for circumstances like

"foo" "bar"

"foo" \
  "bar"

It should have a left node, a right node, and an optional \ operator token. For documentation, see https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#string-literals-.

Dynamic symbols

We need to support dynamic symbols like :"foo" and %s[foo]. This should support interpolation. This will be very similar to the way we have implemented strings. We need this both for dynamic symbols but also %I lists.

Embedded variables

Right now we handle interpolated expressions, but don't handle parsing interpolated variables. For example:

"#@@foo"

This is for instance, class, and global variables. We need to parse this for strings, regular expressions, xstrings, etc. This issue will require updating the lexer as that piece was left unfinished.

Design document is editable

Hi,
I have opened the Google doc with design description via the link in the README and I have editing rights for it, so I guess it's open for anyone, you might want to limit the editing rights to avoid it getting messed up

String escapes

Currently, we're only handling string escapes with a single character. This includes both string literals and character literals. For example:

?\C-a
# => "\u0001"

"\C-a"
# => "\u0001"

In order to support all of the consumers, the string nodes should have a string field on them that is the escaped version of the string to make it easier to use. For a full list of escapes, check https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#label-String+Literals.

Parsing a broken argument gets stuck

I found that it gets stuck to parse a method call with keyword arguments.

$ ruby -Ilib -Itmp/x86_64-linux/stage/lib/yarp/ -ryarp -e 'YARP.parse("foo(k: 1)")'
^C^C^C

As far as I read the code, I think yarp does not support keyword arguments yet. That's fine, but I think an inifinite loop is problematic. For example, parsing the following code fragment also causes an infinite loop.

YARP.parse("foo(@)") # gets stuck too

I believe the following while (true) causes this issue, but I am curious how we should fix it.

https://github.com/Shopify/yarp/blob/937b35b71b5844127495ad9d22160ebfd6d2c654/src/yarp.c#L1578

`&.` safe navigation operator

We need to handle method calls with &.. It should be treated with the same precedence as regular method calls with . or ::.

Top-level constants

Right now we parse constants, but don't handle if they're prefixed with ::. We should parse top-level constant lookups, like:

::Foo

This will involve adding a clause to the parse_expression_prefix function. When this is done, we will also need to support assigning to top-level constants, as in:

::Foo = 1

That will involve updating our handling of the = operator in the parse_expression_infix function.

Singleton method definitions

We need to handle singleton method definitions. This will involve extending the DefNode node to have an optional receiver and operator. For example:

def self.foo = bar
def Object.foo = bar

`MultiTargetNode` enhancements

Right now we can parse multiple targets in a list, but don't handle some other cases.

#428

(foo, bar), baz = value

The second is the ability to use splats, as in:

foo, * = value

The third is the ability to use a trailing comma, as in:

foo, = value

If you want to grab one of these tasks, convert it into its own issue first.

Command calls

We need to properly handle method calls without parentheses. This is much more involved than it may look on the outside. The current Ruby parser maintains a stack of arguments to know where to associate blocks and arguments. At the end of the day, it should parse into nodes that we already have defined like CallNode.

`alias` keyword for global variables

We should be able to use the alias keyword with global variables. This looks like

alias $a $g

This involves extending our parsing of alias arguments to support global variables.

Method parameter order enforcement

At the moment we allow any method parameter declarations to be in any position in the list. So for example def foo(&bar, *baz) is permitted by our parser but shouldn't be.

To maintain flexibility, we want to continue to allow the parameters to be parsed in any order, but we want to add errors if they are found to be incorrect. This allows the user to type them incorrectly and to have a nice experience, as opposed to just crashing.

Prevent regex / division confusion

Today expressions like 1/2 are confused for, IntegerLiteral(INTEGER("1")) followed by an incomplete regular expression because the / is not followed by a space.

We can check that a recent symbol was one of the kind of things that more likely to come before division then a regular expression literal, but I don't feel confident that is a reasonable way to prevent this confusion.

What information can we keep track of in the parser to make thins kind of confusion impossible?

Argument forwarding in method calls

We need to support the ... operator in method calls. This will involve adding some special handling to our parse expression to check if we're inside of an argument list.

Primitives

I think we need an abstraction of Vec<T>:

There's a node_list_t type that stores an array of node pointers. Is it intentional? For better locality it should be an array of nodes (i.e. node * instead of node **).
There's a stack of string literals that has fixed size, so this is another place where we need Vec
We will need a stack of sets of local variables
We will need a stack of "current attribute"s to distinguish errors like in def foo(bar = bar) from def foo(bar = def foo(x = bar); end); end
We need a stack of nested contexts (module in class in method in defined etc) to know where we are during parsing.
We need a stack of numparams to handle nested procs/lambdas with numbered parameters
We need 2 stacks for pattern matching: one to track uniq items in a list, one to track uniq keys in a hash pattern

Also I noticed that nodes encapsulate tokens, is it intentional? There are cases when node needs a value that is based on token, but is slightly different:

in foo.bar = baz method name should be "bar="
in foo.(bar) method name is "call"
in -foo method name is "-@" (same with "+")
unary operators on numeric literals can include sign in their value for simplicity (i.e. "-42" should be Int("-42"))
... and maybe more.

I think it makes sense to extract a type string_t:

typedef struct {
  char *start;
  char *end;
} string_t;

WDYT?

`**nil`

When a method declaration uses **nil, it specifies that it accepts no more keyword arguments. To support this, we will need to update our parse_parameters function.

`rescue` modifier

We need to handle the rescue modifier keyword, as in:

foo rescue nil

This should be treated as an infix operator, and parsed using the parse_expression_infix function.

Multiple assignment

There is a lot that goes into multiple assignment. See all of the documentation here: https://github.com/ruby/ruby/blob/master/doc/syntax/assignment.rdoc#multiple-assignment-. We'll need a MultiTargetNode node on the left hand side, and an arguments node on the right hand side.

Argument forwarding in method declarations

We need to support the ... operator in our method declarations. This will involve updating our parse_parameters function.

Serialization

We want a binary serialization of the syntax tree such that it can be deserialized and read in any other language. This issue is to begin looking at how that would work and what the structure would be. I'm imagining something to the effect of the below design.

First, the header:

# bytes	field
4	YARP
1	major version
1	minor version
1	patch version
8	comments offset into file

Next, the tree. For each node, it'll walk the tree starting from the top. It will dump:

# bytes	field
4	type
8	offset of next node in tree*
8	start offset in source
8	end offset in source

Then for each child node from this node it will recurse. Each child node type will have to have its own kind of serialization. I'll discuss those next.

*I think this is a good idea to include so that folks can skip past this node if it's trivial for the implementation. For example, if it's a Self node and you don't care about offset in the source then you can just skip directly to the next node in the tree.

For a child node that is itself a node, it will use the schema defined above. For a child node that is a string, it will use:

# bytes	field
8	start offset
8	end offset

For a child that is a node list, it will use:

# bytes	field
4	number of elements in the list

and then for each node it will recurse.

Finally, at the end it will have a section for comments, which will look like:

# bytes	field
4	number of comments in the file

then for each comment:

# bytes	field
8	start offset
8	end offset

I believe this is enough information to get started. Certainly we're going to iterate on this.

Lexer state

This is probably the biggest remaining open issue. CRuby maintains a bunch of state in the lexer to determine the next node to return. We're tracking some of the same state, but need to track all of it if we're going to properly match semantics.

Specifically this means mirroring the behavior of parser_params.lex.state and parser_params.command_start. There will probably be others.

Note that this doesn't necessarily have to be the same structure or even names of variables. The idea here is that we need to mirror the behavior but not necessarily the code.

Dynamic symbols with %s

Currently we have support for dynamic symbols, but don't allow them to be created with %s. We need to add handling for that.

Blocks with keywords

We have blocks in place with braces, and they are being parsed correctly. However we do not have support for blocks with keywords (do..end).

There are quite a few different semantics we need to account for. Blocks with keywords can have rescue/else/ensure attached. They also have different precedence. For example:

foo bar {}

the bar gets the block, whereas

foo bar do
end

the foo gets the block.

AST format

I propose to take some in inspiration from this config. There are two things that are different from whitequark/parser and right now I think it was my mistake to introduce them: IfMod and IfTernary should go away, If node is enough to cover all cases.

Apart from this it makes a lot of sense to me. Thoughts on current config:

Assignment - in static analysis from my experience people never look for "an abstract assignment", there's always a need to find a certain type of assignment, like lvar assignment (to track local variables) or ivar assignment. I think it makes sense to introduce separate nodes for all types of assignment like it's done in whitequark/parser, ruby_parser and all parsers derived from them.
Binary - I think it's too generic and complex at the same time. 2 + 2 is a Binary, right? I think it should have the same type as foo.bar, because it is a method call. On the other side all types of "real" binary operators need their own node types because they have special behaviour.
CharacterLiteral - it should be just a String, there's no different between them.
FloatLiteral - let's keep it.
Identifier - this node has no value on its own and it always requires tracking current context to understand the meaning.
IfModifier - like I described above a single If node is enough. If there's a need to track a modifier version source maps can be used, but there's no different in how "normal if" works vs "if modifier". AST should represent meaning, not layout.
ImaginaryLiteral - let's keep it
IntegerLiteral - keep
OperatorAssignment - keep, there's an equivalent in whitequark/parser
Program - ehh, we can keep it, both whitequark/parser, ruby_parser (and IIRC all parsers except parser.y/Ripper) use begin node for it, because there's difference in how top-level block of code is treated comparing to let's say block body or begin..end.
RationalLiteral - keep
Range - I think it makes sense to split it to .. and ... ranges
Redo / Retry - keep
Statements - should be just begin I guess
Ternary - redundant, can be replaced with If
UnlessModifier - redundant, can be if with "inverted" branches
UntilModifier - redundant, there can be a single Until node
VariableReference - sounds a bit verbose, lvar or localvariable should be enough
WhileModifier - redundant, there can be a single While node

@kddnewton PTAL at the document that I linked above and let me know what you think

Magic comment syntax

We need to better handle magic comments. At the moment we only support very limited checking for frozen_string_literal: true. There is a formula for handling this written in CRuby that we should mimic - it needs to additionally handle the emacs format.

For known magic comments, we should store booleans so that consumers don't have to reparse the magic comments.

`FILE`, `LINE,` and `ENCODING` nodes

We should have nodes for all of these. The __LINE__ node should have a number attached to it that is the actual line in the source. The __FILE__ node should have a string attached to it that is the file name that is being parsed. The filename should be accepted to the overall parse function.

Statement sets

When parsing a class, module, method, or do block definition, we need to handle the case of rescue/ensure/else clauses. For example:

def foo
  1
rescue
  2
end

In this case the body isn't a statements node, it's actually a BeginNode. We should allow either one to be present in the method, depending on whether or not the additional keywords were used.

Note that we shouldn't attempt this until #141 and #135 are implemented.

`case` keyword

We need to parse case statements. This ticket is specifically for case and when, not for pattern matching, which will come later. Important to remember that the case's predicate is not required.

`return` keyword

We need a node to handle the return keyword. It should be called ReturnNode, and should have both a keyword and an optional argument node as a child. This looks like:

return
return foo
return foo, bar, baz

For documentation see https://github.com/ruby/ruby/blob/master/doc/syntax/methods.rdoc#return-values-.

`not` keyword

We need to handle the not keyword. It has two forms: with parentheses and without. It is very similar to a unary operator. It looks like:

not foo
not(foo)

Lambda literals

Currently we don't handle lambda literals at all, but we need to. This involves both the brace and do..end forms. It should parse parameters, but also parse any potential block locals, as in -> (; foo) {}.

Strings with %, %q, and %Q

We currently do not support string literals starting with %, %q, and %Q.

Parentheses

We don't currently handle parentheses, but we should. For example:

1 * (2 + 3)

This should continue parsing but lower the binding power down to the beginning and then expect a closing parenthesis.

Operator arguments.

Currently we're parsing arguments that are expressions, but not ones that start with operators. This includes:

Splat arguments that start with *.
Double splat arguments that start with **.
Block arguments that start with &.

This will involve updating the parse_arguments function.

Heredocs

We haven't even begun implementing heredocs, and they're going to be quite a thing. We need to keep a stack of them in the lexer and flush their content when we hit the next newline. We'll need to handle the interpolation or the fact that they can be xstrings depending on how they are declared.

Regular expressions

We need to finish parsing regular expressions.

At the moment we handle regex but only if there's no interpolation. We need to also handle interpolation.
We also need to handle regular expressions with named captures creating local variables. We have a parser here that can handle that, but we haven't hooked it up yet. That looks like:

/aaa (?<foo>bbb) ccc/ =~ "..."
foo

We also need to handle %r regular expressions. At the moment we're only handling regular expressions that start with /.

For documentation, check https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#label-Regexp+Literals and https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#r-regexp-literals-.

Parsing `foo(1) = 2` does not cause an error

foo(1) = 2 is parsed as foo(1).send("=", 2). It should cause a syntax error.

pp YARP.parse("foo(1) = 2")'
#=> #<YARP::ParseResult:0x00007f38b82ca018
#    @comments=[],
#    @errors=[],
#    @node=
#     Program(
#       Scope([]),
#       Statements(
#         [CallNode(
#            CallNode(
#              nil,
#              nil,
#              IDENTIFIER("foo"),
#              PARENTHESIS_LEFT("("),
#              ArgumentsNode(
#                [IntegerLiteral(INTEGER("1"))]
#              ),
#              PARENTHESIS_RIGHT(")"),
#              "foo"
#            ),
#            nil,
#            EQUAL("="),
#            nil,
#            ArgumentsNode(
#              [IntegerLiteral(INTEGER("2"))]
#            ),
#            nil,
#            "foo(1)="
#          )]
#       )
#     )>

#[] and #[]= calls

We need to handle syntax that calls #[] and #[]=. This looks like:

foo[bar]

and

foo[bar] = baz

These should both be call nodes with a synthesized method name.

Create new location param type

When we template out our AST and related code, we have a couple of different kinds of parameters that can be added onto nodes. Those include:

node - another node
node? - an optional node
node[] - a list of nodes
token - a token
token? - an optional token
token[] - a list of tokens
string - a string of characters

For the most part, this information is enough. However, there are times when it is redundant. For example, the DefinedNode node has a token for its keyword on it that is always going to be defined?.

Instead, we should create a location parameter that only stores the location in source. That way we'll save a bit on the overall memory that both the AST and serialization require. The location parameter should be its own struct that just stores two offsets into the source. Once we have that parameter in place, we should replace parameters in the existing tree that are constant to take advantage of this work. Finally, we should reorder the nodes so that the location parameters are always at the end, so that consumers of the serialization don't have to worry about them and can skip past them.

Create a new location parameter type
Replace tokens that are constant in the tree with the new parameter type
Move those tokens to the end of the list of child nodes

Whitespace on tokens

Right now, we skip whitespace before the tokens start. We skip past all of the content before marking the start of the token. The tokens themselves have two pointers: start and end. We should have an additional pointer on all of the tokens called full_start that includes whitespace.

We need this to handle the warn_indent directive and also will need it eventually for error recovery to better handle the end keyword.

Symbols for other tokens

Currently we handle symbols for identifiers and dynamic strings, but there are other kinds of symbols in Ruby. Those include:

We need to handle all of these cases. This will involve updating our parse_symbol function.

`rescue` keyword

Right now we have begin keyword handling but don't parse rescue clauses. These should be modeled like a linked list like the elsif clauses are now. The root node should live as an optional node on the begin node.

XString literals

We need two nodes to handle xstrings. We should have one version for if there's interpolation and one version for when there's not. It should basically be modeled after StringNode and InterpolatedStringNode. They look like:

`foo`
`foo #{bar}`
%x[foo]
%x[foo #{bar}]

For documentation, check https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#label-25x-3A+Backtick+Literals.

`foo() @foo` does not cause an error

foo() @foo is parsed as foo(). In other words, @foo is just ignored.

pp YARP.parse("foo() @foo")
#=> #<YARP::ParseResult:0x00007ff7bb3aa4e8
#    @comments=[],
#    @errors=[],
#    @node=
#     Program(
#       Scope([]),
#       Statements(
#         [CallNode(
#            nil,
#            nil,
#            IDENTIFIER("foo"),
#            PARENTHESIS_LEFT("("),
#            nil,
#            PARENTHESIS_RIGHT(")"),
#            "foo"
#          )]
#       )
#     )>

I think parse_statements should record an error if the token is neither YP_TOKEN_NEWLINE or YP_TOKEN_SEMICOLON after a statement.

https://github.com/Shopify/yarp/blob/937b35b71b5844127495ad9d22160ebfd6d2c654/src/yarp.c#L1558

Symbol lists with interpolation

We need to support %I lists, which create arrays of symbols with interpolation. This looks like

%I[foo bar baz#{1}]

For documentation, see https://github.com/ruby/ruby/blob/master/doc/syntax/literals.rdoc#label-25i+and+-25I-3A+Symbol-Array+Literals. This should be an ArrayNode node with SymbolNode or InterpolatedSymbolNode children. We already handle this for strings, so it should be extremely similar parsing.

`ensure` keyword

We need to extend our begin keyword handling to additionally support the ensure keyword. It should be modeled as its own field on the BeginNode. This looks like:

begin
  foo
ensure
  bar
end

ruby / prism Goto Github PK

prism's Introduction

Prism Ruby parser

Overview

Getting started

Contributing

Examples

Runtimes

Libraries

Applications

prism's People

Contributors

Stargazers

Watchers

Forkers

prism's Issues

Recommend Projects

Recommend Topics

Recommend Org