I'd like to open up the discussion of what our various options are for preserving so called "Concrete Syntax" elements in the parse tree format.
What are Concrete Syntax elements?
A simplified definition of "concrete syntax" is any stuff that could appear in a source code program which, when it's parsed, would be otherwise discarded information that an AST (as it stands now) does not represent.
That specifically means that things which are reliably inferable from an AST's nodes are not concrete syntax. For example:
var a = 2,
b = (a + 2) * 3;
Here, the ( )
around a + 2
is not represented in the AST, because the structure of the tree, combined with operator precedence rules, absolutely implies that it must exist, and moreover a + 2 * 3
would have been a different tree structure. Thus, the ( )
in the original program can be reconstructed from the AST reliably, so it need not be stored separately. It is not concrete syntax.
However:
var a = 2,
b = a + (2 * 3);
This program includes a pair of ( )
that would already be implied by the tree structure and operator precedence, and thus would also not be stored.
But, critically, they also would not be re-generated when the tree was reconstituted. Why? Because it's impossible to know from the tree alone if the ( )
was really there, or just implied. And code generation takes the conservative path and doesn't make up ( )
where it's not sure they were there, so it leaves them out.
Herein we see that the original ( )
are not abstract syntax, but concrete, and to preserve them is going to require some other solution.
Examples of concrete syntax:
- extraneous whitespace (the non-significant kind)
- comments
- extraneous
( )
(used primarily for readability more than functionality)
- ... other?
The Rabbit Hole
Just how deep does this rabbit hole go? In the following snippet, every single /*x*/
comment represents a location where whitespace and/or comments can optionally appear as concrete syntax, in addition of course to the fact that some of those locations require significant whitespace, such as /*4*/
:
/*1*/class /*2*/ Foo /*3*/ extends /*4*/ Bar /*5*/ {
/*6*/constructor /*7*/ (/*8*/x /*9*/ = /*10*/ "hello" /*11*/) /*12*/ {
// ..
}
/*13*/ static /*14*/ bam /*15*/ ( .. ) {
// ..
}
}
var /*16*/{ /*17*/ x /*18*/: /*19*/ { /*20*/ y /*21*/: /*22*/ z /*23*/ = /*24*/ 2 /*25*/ } } =
/*26*/new /*27*/ Foo /*28*/ ( (/*29*/) /*30*/ => /*31*/ { .. } /*32*/ ) /*33*/;
Your imagination can probably take it from there. There's a whole slew of complex ES6 syntax forms which implies a deep rabbit hole of nooks and crannies where we need to be able to preserve information (in some way) that our normal approach to AST doesn't currently preserve.
Consider the tree structure for the arrow function expression, for example... how and where could we represent /*29*/
and /*30*/
? /*31*/
and /*32*/
are a little clearer.
Why Concrete Syntax?
Why would we want to preserve all these pieces of concrete syntax? There's several use-cases:
-
Any tool which performs localized transformations on a source code file, which doesn't want to change everything, but only a targeted specific thing. For example, a tool that does nothing but rewrite all variable names to uppercase versions (for whatever silly reason).
The goal of this tool is not to affect anything else about the program, such as formatting and comments, as those might still be important to the author of the file.
-
Fully configurable automated code formatting: not just rule based, like "always put a space between an ( )
on a call" or stuff like that, but more fine-grained rules, where a tool might parse a source program and produce a tree structure with very specific information in it about how the resultant code should be re-generated.
For example, it may automatically insert comments for each parameter in a function declaration with some sort of annotations about how and where the param is lexically used, etc. Or a code-style painter may "repaint" a file with spaces for indentation vs tabs, or may insert spaces for alignment and indentation with tabs, etc etc.
I could go on speculating, but I'll just leave it that there are definitely cases for tools which want to be able to preserve concrete syntax wherever it appears. Since concrete syntax by definition cannot be inferred, the parser and data structures that come out of it must be able to do so. Moreover, this information must be something a code generator can receive and use.
How?
Here's the part where all the bikeshedding is going to happen.
To date, conversations around this topic have happened many times that I've been privy to, and there's never been any kind of consensus on how to approach solving it. I have my ideas, but I'm only going to suggest them as a possible starting point proposal, not that it has to be this way.
CST-as-AST Proposal
I believe we should have one unified tree structure, which has optional -- what I call "extras" -- annotations (and in a few limited cases, nodes) in it which represent the necessary hooks for preserving these concrete syntax elements. In other words, I propose that there be no difference between an AST and a CST (concrete syntax tree), other than the absence or presence of CS elements in the tree.
Any tool which currently produces ASTs is a tool that's producing CSTs by default, but which just happens to not actually be keeping any of the CS elements yet. These tools can start keeping the CS elements, but still have the same style of tree they're producing, just with extra info in them.
Any tool which consumes ASTs is a tool that's already consuming CSTs by default, but which is just ignoring any CS elements which may be present. These tools can start using the CS elements they find.
It turns out that most of the places where we need to preserve CS elements can be added as additional properties (again, I call it "extras", with extra sub-names like "before", "inside", and "after" for positioning), which means that there would be zero impact to the existing tools that use such a tree format.
Tree producing tools (parsers, transpilers, etc) could just not produce these annotations, but things still work fine. Tree consuming tools continue to consume the trees as they currently do, and just ignore the these extras as they currently do.
I believe this has the most minimal impact to the existing tool ecosystem, and thus the easiest path to wider adoption by more tools.
Downside
There will be some minor places where the node structure will have to be slightly different to accommodate some of the trickier cases of CS positioning.
For example, anonymous function expressions that have an id
of null
means that we don't have an object value in that node to attach any extras
annotations to in the function/*1*/()
position.
If we could represent an anonymous name entry instead of id: null
as an object like id: { extras: .. }
, this will mean we have a hook to annotate those extras.
This does represent a slight breaking change to the format, but it's not a major sweeping new tree format, and will require on the whole just a small bit of extra handling care. These necessary node structure breaking changes are minor and very few for the pre-ES6 tree structure (aka SpiderMonkey).
The new ES6 forms definitely add more places where we should consider tree structure from the beginning which are amenable to attaching these annotations. Since there's not already an established standardized ES6 tree format, I think it's not too late for us to consider these concerns as we specify the ES6 node forms.