getify / literalizer Goto Github PK

Specialized heuristic lexer for JS to identify complex literals

JavaScript 100.00%

literalizer's Introduction

literalizer

Specialized JS lexer which applies heuristic rules to identify the complex literals first. These literals cause frustration/complication in any JS "parsing" task, as they are a primary source of context-sensitive grammar.

By applying various heuristic rules during lexing, however, these literals can be identified in a first-pass, leaving everything else alone. This allows subsequent parsing to be significantly less complex, possibly even context-free (regular expressions!), or it allows you to easily find the complex literals and target them for special processing.

Some Use Cases

Syntax highlighting is a far more trivial task with regular expressions if these complex literals are already pre-identified and cannot cause false-matches.
Easily search for special meta-commands contained in code comments, such as // @sourceURL=...
Find all regular expression literals and pass them through an optimization engine and then replace them with their optimized equivalents.
Implement macros or other code pragmas which have to be processed before normal JS parsing can proceed.
Parse out certain code patterns for things like dependency injection.

Relaxed

Another key feature of literalizer is that it's a "relaxed" lexer, in that it can run against code which is not strictly valid JS and yet still give a best-effort try. Most of the heuristics are based off fundamental language grammar, such as ASI and where and how statements and expressions can appear.

However, as long as your code variations don't change the rules for statements and expressions, many syntax/grammar errors, non-standard keywords/constructs, and other invalidations will still just pass through successfully.

A relaxed lexer is also crucial for tasks like on-the-fly syntax highlighting, which must be able to adjust to not-yet-completely-valid code.

Identified Literals

The (complex) literals that will always be identified are:

strings (" or ' delimited)
comments (single-line or multi-line)
regular expressions
ES6 template strings (` delimited)

Optional Literals

There are also configuration options that control identification of:

(default: on) HTML-style comment markers () as single-line comment delimiters; more info
(default: off) number literals, including integer, decimal, octal (ES5/ES6), hex (ES5/ES6), and binary (ES6 only)
(default: off) simple literals (null, Infinity, etc)

Options

literalizer can be configured to control which of the optional literals are identified explicitly.

LIT.opts.overlook_html_comment_markers (boolean; default: false) - If set to true, will overlook (that is, refuse to recognize) the  HTML-style comment markers as single-line comment delimiters, leaving them instead to be treated as standard JS operator sequences; more info
LIT.opts.identify_number_literals (boolean; default: false) - If set to true, will explicitly identify number literals
LIT.opts.identify_simple_literals (boolean; default: false) - If set to true, will explicitly identify simple literals

API

literalizer's API includes:

LIT.lex(..) takes a string of code and returns an array of segments, which each have a type property for identifying the segment type, according to which literal (or general text) it represents.
LIT.generate(..) takes an array of segments (as produced by lex(..)) and re-generates the source code. This might be useful if you wanted to modify (or add/remove) segments after literalizer analysis and then re-compile the code.
LIT.reset() resets the warnings list from previous runs of lex(..).
LIT.warnings is an array of any warnings encountered while lexing.

License

The code and all the documentation are released under the MIT license.

http://getify.mit-license.org/

literalizer's People

Contributors

Stargazers

Watchers

Forkers

markpj1

literalizer's Issues

heuristics for "block_allowed" after a : don't properly account for switch clauses

Currently, literalizer only treats a { } pair after a : as being a block if the : was from a statement label.

However:

switch (foo) {
   case 'bar': 
      { blah(); }
      /b/g;
}

Fails to identify the /b/g regex, because the { .. } before it is not seen as a block.

skip regex / in [ ] character class

An unescaped / in a [ ] character class in a regex is allowed and not the end of a regex literal. fix!

Handle streaming lexing

[edit: renamed to change the scope/intent of this bug per this comment: https://github.com//issues/20#issuecomment-23993775]

fs.readFile (and fs.readFileSync) provides a Buffer object. Passing this into literalizer.lex returns very unexpected data.

It would be great if it could read from the buffer directly, but if it requires string input, could you pass the input of "lex" through String()?

need more tests for: nesting of function-properties mixed with default function param values

tests like this to verify when function-property candidates are abandoned, etc:

var a={b(c={d:e=>{f;}}){a:
{}/a/g}};

Haven't found a breaking test where this is a bug, yet. But we don't yet have good test coverage of that complication, so need to investigate other tests and see if there's a bug, and at a minimum, include those tests.

add config option(s) for which extra literals to identify

Literalizer was initially intended just to identify the complex literals (comments, strings, regexes, and back-ticks).

I added number-literal detection recently since the regex pattern for that is rather complex and you do have to look at the preceding context to make sure it's not part of an identifier (also a complex pattern).

Now I'm thinking while we're at it, why not go ahead and identify the simple literals null undefined true false Infinity and NaN?

Side Note: JS lexes Infinity and NaN as "Identifiers" and then later treats them as numbers during execution. This seems strange to me, like they should have just been lexed as number-literals. But I decided not to diverge from the way JS lexes. Same goes for not including the - in the number literal like with -42. JS doesn't treat the - as part of the number, so neither do I (which turns out to be much simpler/more performant for literalizer!).

But then, I realize that identifying numbers and these other simple literals costs perf and may or may not be useful to someone using literalizer.

So, perhaps we should put in two config options, like:

LIT.opts.identify_numbers = true;
LIT.opts.identify_simple = true;

For performance reasons, we might default both to false, but let people opt into those extra literals if they want them.

I also considered identifying keywords (and even operators), but I don't think those should qualify as literals. But I'm open to suggestions on that.

Just want to keep in mind, the goal here is a heuristic lexer to identify whatever we can as fast as we can without needed to do full grammar parsing (which is much slower). So have to balance usefulness against keeping literalizer "fast". If it's as slow as normal full-grammar parser but does far less, there's little point to its existence (except of course its "relaxed" mode).

The internal complication basically boils down to compiling a different general-state regex depending on the state of these configs. I don't think any of the rest of the code would need to consult the opts.

It would just consult the opts at the beginning of lex(), compile the appropriate general-state regex pattern, and proceed as normal. This might provide an initial fixed-cost hit to warm-up, but I think it should allow the case where you don't need those other non-complex literals to run a bit faster if not bogging down the general-state regex pattern (especially on bigger, more complex files, like a jQuery).

no release tags

Every time something's published to npm (and the version number is bumped), that SHA should be tagged with the version, like v1.0.0. That complies with both npm module convention, and with Github's "releases" feature.

Unfortunately I can't PR tags, or I'd do it for you :-/

tests now need to permute the config options

Now that the options can be configured, need to permute those options across the test results, to ensure it works not only with an option on (right now, tests assume all optional literals identified) but also with it off, and the various combinations thereof.

fix how template literal is processed

There are quite a few problems with the existing implementation regarding template literals. Off the top of my head:

the way it looks for the ending after starting a template literal is wrong. Instead of just looking for the next, it should count them in a stack, much like matching { } or ( ) pairs would be done.
Template literals can have other template literals inside their expressions.
Each ${ inside the literal should break out and just create elements of text (code), but must track { } pairs so it can identify the ending }, and go back into template literal mode, resuming its search for either ${ or ```.
The tagged identifier before a template literal, if present, should be included as part of the first template literal element in the stream. heuristic: preceding whitespace then identifier.

track line number and column

track line numbers and columns for all tokens.

for the purpose of tracking new-lines (ASI), new-lines in multi-line comments should count

subject says it all.

false-negatives on regex identification, need better rules

[ Taken from the great pseudo-code and descriptions here: https://github.com/mozilla/sweet.js/wiki/design#give-lookbehind-to-the-reader ]

Examples currently broken:

if (true) /foo/g;
while (true) /foo/g;
for (;;) /foo/g;
with (foo) /foo/g;

and

function(){} /foo/g;
() => {} /foo/g;

Update: there's more to fix than the above-linked algorithm suggests:

sweet-js/sweet-core#82 (comment)

var a, b = 2, i 3;

{x=4}/b/i;             // `/` starts a regex literal
{x:4}/b/i;             // `/` starts a regex literal
{y:5}{x:4}/b/i;        // `/` starts a regex literal
return
{x:4}/b/i;             // because of ASI, `/` starts a regex literal
throw
{x:4}/b/i;             // because of ASI, `/` starts a regex literal

a={x:4}/b/i;           // `/` is a divide
foo({x:4}/b/i);        // `/` is a divide
a={y:{x:4}/b/i};       // `/` is a divide
return{x:4}/b/i;       // `/` is a divide
throw{x:4}/b/i;        // `/` is a divide

block-allowed state context needs to be a stack, not a flip-state variable

for-loops... yay.

for( ; {a:2}/a/g ; ){}   // <-- `block-allowed`, and thus a { } after a ; operator,
                         // needs to be `false` during ( ) of a for-loop, except...

for( ; {a:/a/g} ; ){}  // here, inside the { } object literal, of course it's a regex

for( ; function(){ /a/g; } /a/g; ){}  // here, function gives us block-allowed context

I think the first case basically boils down that inside a ( ) set, a block is not allowed, unless we get a function, which can then create a new block context. This basically means that the simple binary flipping of block-allowed or not is not sufficient, actually need to refactor to handle a "stack" of block-allowed states.

Another way to examine it: usually a ; would signal that block-allowed goes back to true, but this is not true if we happen to be immediately inside a ( ) pair, which I think for-loops are the only way that ( ) can have ; operators in it, and a function is the only construct inside ( ) which can create block-allowed context, thus we need a stack of block-allowed states. :/

Parse HTML comments

JS supports HTML comments, which is insane, but there it is.

Specifically, currently:

var code = 'foo();\n// bar\n/*baz\n\n\nquux*/\n<!-- html comment\n\n--> another comment\n\nbaz()';
lex(code) => [
  { type: 0, val: 'foo();\n' },
  { type: 1, val: '// bar' },
  { type: 0, val: '\n' },
  { type: 1, val: '/*baz\n\n\nquux*/' },
  { type: 0, val: '\n<!-- html comment\n\n--> another comment\n\nbaz()' }
]

however, it should give:

var code = 'foo();\n// bar\n/*baz\n\n\nquux*/\n<!-- html comment\n\n--> another comment\n\nbaz()';
lex(code) => [
  { type: 0, val: 'foo();\n' },
  { type: 1, val: '// bar' },
  { type: 0, val: '\n' },
  { type: 1, val: '/*baz\n\n\nquux*/' },
  { type: 0, val: '\n' },
  { type: 1, val: '<!-- html comment' },
  { type: 0, val: '\n\n' },
  { type: 1, val: '--> another comment' },
  { type: 0, val: '\n\nbaz()' }
]

Various ES6 short-hand function syntax variations not yet handled

ES6 adds a few function syntax variations which drop the "function" keyword and have short-hand grammar instead.

For example:

var a = { foo() { /* I'm a short-hand function in a object property */ } };

var a = foo => { /* I'm an arrow-function */ };

Namely, arrow functions don't require a { } pair for the function body, so that peculiarity needs to be handled.

Multi-line string literals "lose" their multi-line'ness

file.js:

var a = "this is \
a multi-line string";

and then

LIT.lex(file_js_contents);

results in:

[
    {
        "type": 0,
        "val": "var a = "
    },
    {
        "type": 2,
        "val": "\"this is a multi-line string\""
    },
    {
        "type": 0,
        "val": ";"
    }
]

You can see that the string-literal value has no new-line character in it in this example. This matches with how the JS engine would interpret that same code (that is, the engine would not see any end-of-line escaped new-line character, but rather it's seen as a line continuation by the engine).

It appears that esprima and acorn account for this (and possibly other "information loss") by tracking a raw value on nodes as well. Should do the same, for the string literal segment at least.

"unlex" option?

I know it's probably trivial to write, but it'd be awesome if literalizer had an "unlex" method that took the results of "lex", and reproduced the original file.

A major benefit here would be, I could modify values, and use literalizer to reconstruct a modified file.

identify number literals

it would appear number literals are complex enough to need heuristic lexing (can't just be done with regexes).