Giter Site home page Giter Site logo

tree-sitter / tree-sitter Goto Github PK

View Code? Open in Web Editor NEW
16.5K 179.0 1.2K 16.08 MB

An incremental parsing system for programming tools

Home Page: https://tree-sitter.github.io

License: MIT License

C 29.22% C++ 0.63% Python 0.11% Shell 0.96% Batchfile 0.07% Rust 62.01% JavaScript 6.31% HTML 0.19% Makefile 0.31% Swift 0.13% Zig 0.02% Go 0.03% Dockerfile 0.01%
incremental parsing c tree-sitter rust parser wasm

tree-sitter's Introduction

tree-sitter

DOI discord matrix

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be:

  • General enough to parse any programming language
  • Fast enough to parse on every keystroke in a text editor
  • Robust enough to provide useful results even in the presence of syntax errors
  • Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application

Links

tree-sitter's People

Contributors

ahelwer avatar ahlinc avatar amaanq avatar aminya avatar bfredl avatar cybershadow avatar daumantas-kavolis-sensmetry avatar dcreager avatar dependabot[bot] avatar dundargoc avatar hendrikvanantwerpen avatar ikatyang avatar j3rn avatar joshvera avatar mattmassicotte avatar maxbrunsfeld avatar mkvoya avatar nhasabni avatar observeroftime avatar patrickt avatar philipturnbull avatar razzeee avatar rhysd avatar robrix avatar skalt avatar smoelius avatar tclem avatar the-mikedavis avatar ubolonton avatar wingrunr21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tree-sitter's Issues

Failed to load the grammar package

I'm working on a tree-sitter grammar for Agda.
But then I came across this error when I was trying to load the developing grammar package.

The module '/Users/banacorn/node/tree-sitter-agda/build/Release/tree_sitter_agda_binding.node'
was compiled against a different Node.js version using
NODE_MODULE_VERSION 57. This version of Node.js requires
NODE_MODULE_VERSION 54. Please try re-compiling or re-installing
the module (for instance, using `npm rebuild` or`npm install`). in /Users/banacorn/.atom/dev/packages/language-agda/grammars/tree-sitter-xagda.cson

I've followed the instructions of this thread and spent hours fiddling with stuff but end up with nothing :(

Here's the version of Atom I'm running:

Atom    : 1.25.0-beta3
Electron: 1.7.11
Chrome  : 58.0.3029.110
Node    : 7.9.0 

Language introspection

Nodes have a count and a named count, but itโ€™d be useful to be able to tell this a priori about the production for a given TSSymbol (if Iโ€™m understanding these pieces correctly). That is, itโ€™d be nice to be able to discover:

  • the productions in a given language
  • whether a production is terminal or nonterminal
  • more precisely, whether a production can produce named children or not

Tokens that match the empty string can result in infinite loops during error recovery

This feels like a bug in tree-sitter, if I remove the stuff after pkg demo = then it at least stops crashing but I think a bug like that isnt good to have around.
Actually it seems to happen in a lot of scenarios, obviously my grammar is wrong but there should be error outputs then... not a stuck loop.

use bio
use std
use "test.use"

pkg demo = const sayhello : (-> void)
        ;;

{
   "name":"arithmetic",
   "extras":[
      {
         "type":"PATTERN",
         "value":"\\s"
      }
   ],
   "rules":{
      "translation_unit":{
         "type":"REPEAT",
         "content":{
            "type":"CHOICE",
            "members":[
               {
                  "type":"SYMBOL",
                  "name":"use"
               },
               {
                  "type":"SYMBOL",
                  "name":"pkg"
               }
            ]
         }
      },
      "pkg":{
         "type":"SEQ",
         "members":[
            {
               "type":"STRING",
               "value":"pkg"
            },
            {
               "type":"SYMBOL",
               "name":"system_lib_string"
            },
            {
               "type":"STRING",
               "value":"="
            },
            {
               "type":"SYMBOL",
               "name":"string_literal"
            }
         ]
      },
      "use":{
         "type":"SEQ",
         "members":[
            {
               "type":"STRING",
               "value":"use"
            },
            {
               "type":"CHOICE",
               "members":[
                  {
                     "type":"SYMBOL",
                     "name":"string_literal"
                  },
                  {
                     "type":"SYMBOL",
                     "name":"system_lib_string"
                  }
               ]
            }
         ]
      },
      "string_literal":{
         "type":"TOKEN",
         "content":{
            "type":"SEQ",
            "members":[
               {
                  "type":"STRING",
                  "value":"\""
               },
               {
                  "type":"REPEAT",
                  "content":{
                     "type":"CHOICE",
                     "members":[
                        {
                           "type":"PATTERN",
                           "value":"[^\"]"
                        },
                        {
                           "type":"STRING",
                           "value":"\\\""
                        }
                     ]
                  }
               },
               {
                  "type":"STRING",
                  "value":"\""
               }
            ]
         }
      },
      "system_lib_string":{
         "type":"TOKEN",
         "content":{
            "type":"SEQ",
            "members":[
               {
                  "type":"REPEAT",
                  "content":{
                     "type":"PATTERN",
                     "value":"\\w+"
                  }
               }
            ]
         }
      }
   }
}

\d (for example) in regexps is not escaped in the C parser

A production like /\d/ is not escaped in the symbol table, leading to compiler warnings:

../src/parser.c:136:133: warning: unknown escape sequence '\d' [-Wunknown-escape-sequence]
    [aux_sym_SLASH_LBRACK_BSLASHd_RBRACK_PLUS_LPAREN_BSLASH_DOT_LBRACK_BSLASHd_RBRACK_PLUS_RPAREN_LBRACE0_COMMA2_RBRACE_SLASH] = "/[\d]+(\.[\d]+){0,2}/",
                                                                                                                                    ^~
../src/parser.c:136:138: warning: unknown escape sequence '\.' [-Wunknown-escape-sequence]
    [aux_sym_SLASH_LBRACK_BSLASHd_RBRACK_PLUS_LPAREN_BSLASH_DOT_LBRACK_BSLASHd_RBRACK_PLUS_RPAREN_LBRACE0_COMMA2_RBRACE_SLASH] = "/[\d]+(\.[\d]+){0,2}/",
                                                                                                                                         ^~
../src/parser.c:136:141: warning: unknown escape sequence '\d' [-Wunknown-escape-sequence]
    [aux_sym_SLASH_LBRACK_BSLASHd_RBRACK_PLUS_LPAREN_BSLASH_DOT_LBRACK_BSLASHd_RBRACK_PLUS_RPAREN_LBRACE0_COMMA2_RBRACE_SLASH] = "/[\d]+(\.[\d]+){0,2}/",
                                                                                                                                            ^~

Presumably this wonโ€™t match digits correctly either.

Parsing HTML templates in JavaScript as HTML

Hello All!

Was super excited to download 1.25 and switch on the tree-sitter parser flag. So far so good! ๐Ÿ‘ ๐Ÿ‘

I'm using a neat new library called lit-html to build HTML templates in Javascript. A typical lit-html template might look like this:

const breadcrumbTemplate = ({name}) => html`
  <ul id="breadcrumb" aria-label="Breadcrumb">
    <li><a href="/">Home</a></li>
    <li aria-current="page">${name}</li>
  </ul>
`;

or like this (if using the lit-extended variety)

const restaurantView = (restaurant = {}) => html`
  <restaurant-view restaurant="${restaurant}" on-review-submitted="${event => updateReviews(event}">
    ${restaurantId ? fetchReviews(restaurantId).then(reviewsList) : Promise.resolve([])}
  </restaurant-view>
`;

Maybe you can see right here in this issue how GitLab Flavoured Markdown parses the contents of the html-tagged template string as html code. That's basically my request here:

How can I hook it up so that tree-sitter parses the html in the template as html, but still shows the js expressions inside the ${}s as JS?

Here's how it looks on my screen right now.

screen shot 2018-03-18 at 19 15 22

A follow on from this specific (syntax highlighting) request would be to fold html within the template, as current behavior folds the entire template (even if it's many paragraphs long) whenever I fold anywhere in the template string.

I'm happy to contribute, but would appreciate a little guidance on where to start looking, since I'm relatively unfamiliar with the atom codebase.

Thanks for considering this issue and keep up the great work!

Tests should give a helpful error message if grammar repos are not present

I tried running script/test -b, following what the Travis build does. I got a segmentation fault:

$ script/test -b
[... compilation output ...]
Random seed: 1509256178
Executed 117 tests.script/test: line 104: 27775 Segmentation fault      $cmd "${args[@]}"

real	0m0.008s
user	0m0.004s
sys	0m0.000s

Same thing when I then tried the tests binary directly:

$ out/Test/tests
Random seed: 1509256222
.....................................................................................................................Segmentation fault

Taking advantage of the handy valgrind support in the test script, I get these details:

$ script/test -g
[... compilation output ...]
==30044== Memcheck, a memory error detector
==30044== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==30044== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==30044== Command: out/Test/tests --reporter=singleline
==30044== 
Random seed: 1509256320
Executed 117 tests.==30044== Invalid read of size 4
==30044==    at 0x23A1C0: ts_document_set_language (document.c:43)
==30044==    by 0x1CE675: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#3}::operator()() const (fuzzing-examples.cc:36)
==30044==    by 0x1CF12A: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#3}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==    by 0x11CCA7: std::function<void ()>::operator()() const (functional:2127)
==30044==    by 0x11C09A: bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&)::{lambda()#3}::operator()() const (grammar.h:126)
==30044==    by 0x11DC51: std::_Function_handler<void (), bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&)::{lambda()#3}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==    by 0x11CCA7: std::function<void ()>::operator()() const (functional:2127)
==30044==    by 0x11B5F4: bandit::adapters::snowhouse_adapter::adapt_exceptions(std::function<void ()>) (snowhouse.h:12)
==30044==    by 0x11C29B: bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&) (grammar.h:128)
==30044==    by 0x11C8AB: bandit::it(char const*, std::function<void ()>) (grammar.h:179)
==30044==    by 0x1CE8E7: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const (fuzzing-examples.cc:59)
==30044==    by 0x1CF249: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==30044== 
==30044== 
==30044== Process terminating with default action of signal 11 (SIGSEGV)
==30044==  Access not within mapped region at address 0x0
==30044==    at 0x23A1C0: ts_document_set_language (document.c:43)
==30044==    by 0x1CE675: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#3}::operator()() const (fuzzing-examples.cc:36)
==30044==    by 0x1CF12A: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#3}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==    by 0x11CCA7: std::function<void ()>::operator()() const (functional:2127)
==30044==    by 0x11C09A: bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&)::{lambda()#3}::operator()() const (grammar.h:126)
==30044==    by 0x11DC51: std::_Function_handler<void (), bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&)::{lambda()#3}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==    by 0x11CCA7: std::function<void ()>::operator()() const (functional:2127)
==30044==    by 0x11B5F4: bandit::adapters::snowhouse_adapter::adapt_exceptions(std::function<void ()>) (snowhouse.h:12)
==30044==    by 0x11C29B: bandit::it(char const*, std::function<void ()>, bandit::detail::listener&, std::deque<bandit::detail::context*, std::allocator<bandit::detail::context*> >&, bandit::adapters::assertion_adapter&, bandit::detail::run_policy&) (grammar.h:128)
==30044==    by 0x11C8AB: bandit::it(char const*, std::function<void ()>) (grammar.h:179)
==30044==    by 0x1CE8E7: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const (fuzzing-examples.cc:59)
==30044==    by 0x1CF249: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1731)
==30044==  If you believe this happened as a result of a stack
==30044==  overflow in your program's main thread (unlikely but
==30044==  possible), you can try to increase the size of the
==30044==  main thread stack using the --main-stacksize= flag.
==30044==  The main thread stack size used in this run was 8388608.
==30044== 
==30044== HEAP SUMMARY:
==30044==     in use at exit: 11,019 bytes in 93 blocks
==30044==   total heap usage: 12,319 allocs, 12,226 frees, 944,670 bytes allocated
==30044== 
==30044== For a detailed leak analysis, rerun with: --leak-check=full
==30044== 
==30044== For counts of detected and suppressed errors, rerun with: -v
==30044== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

From a quick look at the stack trace, we attempted a read from 0; and it looks like we're in the middle of some fuzzing. So probably the fuzzing has done its job. :-)

This is 100% reproducible for me so far (4/4) -- happy to provide whatever further details would be useful for debugging. I'm on Debian 9.1 stretch (aka stable), on x86_64, if that helps you reproduce it yourself.

Canโ€™t use slashes in test names

While working on tree-sitter-ruby I tried naming a test if/elsif/else and later noticed that it wasnโ€™t being run at all. Looks like it picked up later ones in the file alright tho.

Incompatible language version

I'm getting errors like this when loading the grammar.

RangeError: Incompatible language version. Expected 5. Got 6

After some digging, I found that there's a line generated in src/parser.c:

#define LANGUAGE_VERSION 6

I was able to remedy the problem by overwriting the LANGUAGE_VERSION from 6 to 5 by hand every time I rebuild the package.

However, the version has bumped to 7 at the time of writing.
https://github.com/tree-sitter/tree-sitter/blame/84b15d2c782331352952a3e67adbb815c99b9d11/include/tree_sitter/runtime.h#L12

Should I be using a specific version of tree-sitter-cli (hence tree-sitter-cli) when developing my tree-sitter grammar to avoid this conflict?

The complete stack:

/Applications/Atom.app/Contents/Resources/app.asar/src/package.js:587 Failed to load grammar: /Users/banacorn/.atom/dev/packages/language-agda/grammars/tree-sitter-agda.cson RangeError: Incompatible language version. Expected 5. Got 6
    at new TreeSitterLanguageMode (/Applications/Atom.app/Contents/Resources/app.asar/src/tree-sitter-language-mode.js:18:19)
    at GrammarRegistry.languageModeForGrammarAndBuffer (/Applications/Atom.app/Contents/Resources/app.asar/src/grammar-registry.js:165:14)
    at grammarScoresByBuffer.forEach (/Applications/Atom.app/Contents/Resources/app.asar/src/grammar-registry.js:354:39)
    at Map.forEach (native)
    at GrammarRegistry.grammarAddedOrUpdated (/Applications/Atom.app/Contents/Resources/app.asar/src/grammar-registry.js:336:32)
    at GrammarRegistry.addGrammar (/Applications/Atom.app/Contents/Resources/app.asar/src/grammar-registry.js:406:12)
    at TreeSitterGrammar.activate (/Applications/Atom.app/Contents/Resources/app.asar/src/tree-sitter-grammar.js:66:39)
    at Package.loadGrammarsSync (/Applications/Atom.app/Contents/Resources/app.asar/src/package.js:585:17)
    at Workspace.deserialize (/Applications/Atom.app/Contents/Resources/app.asar/src/workspace.js:359:13)
    at AtomEnvironment.deserialize (/Applications/Atom.app/Contents/Resources/app.asar/src/atom-environment.js:1241:41)
    at <anonymous>

The version of Atom I'm using:

Atom    : 1.25.0
Electron: 1.7.11
Chrome  : 58.0.3029.110
Node    : 7.9.0

Memory leak in `parser__do_all_potential_reductions`

I've only seen this leak trigger with the tree-sitter-ruby grammar but it seems like it is caused by libruntime.a. This can be reproduced with the instructions in 943dfba:

    /home/philipturnbull/src/tree-sitter/out/ruby_fuzzer: Running 1 inputs 1 time(s) each.
    Running: leak-1

    =================================================================
    ==4380==ERROR: LeakSanitizer: detected memory leaks

    Direct leak of 232 byte(s) in 1 object(s) allocated from:
        0 0x4c1623 in __interceptor_malloc /b/build/slave/linux_upload_clang/build/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cc:88:3
        1 0xb0af9e in ts_malloc (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xb0af9e)
        2 0xadb073 in stack_node_new (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xadb073)
        3 0xae71da in ts_stack_push (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xae71da)
        4 0xa613a3 in parser__reduce (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa613a3)
        5 0xa901af in parser__do_all_potential_reductions (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa901af)
        6 0xa6dc3d in parser__handle_error (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa6dc3d)
        7 0xa4e217 in parser__advance (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa4e217)
        8 0xa457a0 in parser_parse (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa457a0)
        9 0xa1461c in ts_document_parse_with_options (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa1461c)
        10 0x4f135a in LLVMFuzzerTestOneInput (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0x4f135a)
        11 0xb42d72 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerLoop.cpp:517:13
        12 0xb35f4a in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerDriver.cpp:280:3
        13 0xb3a708 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerDriver.cpp:703:9
        14 0xb35ca0 in main /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerMain.cpp:20:10
        15 0x7f2a19d8782f in __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:291

    ...

    SUMMARY: AddressSanitizer: 2760 byte(s) leaked in 20 allocation(s).

Should tree-sitter-cli really be added as one of dependencies in parser?

Since Atom 1.25.0 has been released a few hours ago, I started learning tree-sitter and read the documentation. In the Installing the tools section, I found the following statement.

Add tree-sitter-cli to the dependencies section of package.json

In my understanding, however, tree-sitter-cli serves as kind of a build tool and is not of much use for the end users (e.g., authors of Atom's language packages). Therefore, I suppose it should be added as a devDependencies just like the libraries in this organization do.

I could prepare a PR regarding this point, but before that I want to know if I'm right or not.

S-expression whitespace syntax

In the docs, it says

The exact placement of whitespace in the S-expression doesn't matter, but ideally the syntax tree should be legible.

In my tests though, something that results in the tree (program(text_mode)) would succeed when the following test tree was used

(program
  (text_mode))

but not when this is used

(program
  (text_mode)
)

Is this by design, or a bug? I assumed "placement of whitespace ... doesn't matter" applied to line breaks as well, so I was confused when it told me the tests failed.

clang "uninitialized" warnings on `error_start_position` etc. in `parser__lex`

When building tree-sitter, I get a 75-line block of clang warnings that look like this:

In file included from ../vendor/tree-sitter/src/runtime/length.h:6:0,
                 from ../vendor/tree-sitter/src/runtime/tree.h:11,
                 from ../vendor/tree-sitter/src/runtime/stack.h:9,
                 from ../vendor/tree-sitter/src/runtime/parser.h:8,
                 from ../vendor/tree-sitter/src/runtime/parser.c:1:
../vendor/tree-sitter/src/runtime/parser.c: In function โ€˜parser__get_lookaheadโ€™:
../vendor/tree-sitter/src/runtime/point.h:22:12: warning: โ€˜error_end_position.extent.columnโ€™ may be used uninitialized in this function [-Wmaybe-uninitialized]
     return point__new(0, a.column - b.column);
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../vendor/tree-sitter/src/runtime/parser.c:322:32: note: โ€˜error_end_position.extent.columnโ€™ was declared here
   Length error_start_position, error_end_position;
                                ^~~~~~~~~~~~~~~~~~
[...]

where the rest are the same thing for different fields of error_end_position and its friend error_start_position.

Looking closer, it takes only a minute to see in the code that these definitely (unless I'm missing something clang is seeing!) can't actually be used uninitialized. The only place these fields can be used is here:

  if (skipped_error) {
    Length padding = length_sub(error_start_position, start_position);
    Length size = length_sub(error_end_position, error_start_position);
    [...]

and the only place skipped_error gets set to true is immediately followed by initializing these variables.

But this sure is some spew from the compiler -- which makes the build look less solid than it is, and also if clang were to start warning about something else, this could mask that. So it'd be great to work out how to satisfy the uninitialized-checker, or to tell it to ignore this particular spot.

Fortran grammar for tree sitter

Hi, I took notice of the pull request to use this type of parser for Atom syntax highlighting. Fortran is a language that would greatly benefit from this type of parsing for a lot of the same reasons as other lower level languages C, C++, etc. I am not experienced with parsing a language based on a CFG but I have been doing a lot of reading and digging into the existing code base to get a handle on things.

I have setup a repository to start working on the grammar, modeling it after the ones already under this group (at the time of this post only the basic dot files and the like are in place).

I was going to base it off the Waite/Cordy grammar provided in the Grammar Zoo since it seemed the easiest to work with. I noticed the C grammar was based off content in the same website so I thought it would be a good starting point. I'll cross reference this with the syntax highlight grammar defined in the language-fortran package to try and reduce the odds of missing anything from newer standards. The main thing I am not sure of how to handle would be the differences between Free Form and Fixed Form Fortran.

If you have any tips beyond the example in the README on how best to proceed with this process they would be greatly appreciated. Or if there is somewhere better to pull an existing grammar from I will gladly use as the starting point instead.

Cheers!
Matt

Memory leak when rebalancing repetition trees

Fuzzing with AddressSanitizer uncovered a memory leak which can be triggered by the bash, C and python grammars:

=================================================================
==15==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x56dd04 in ts_tree_make_node /src/octofuzz/src/runtime/tree.c:378:18
    #5 0x55fbd0 in parser__reduce /src/octofuzz/src/runtime/parser.c:665:20
    #6 0x55bf87 in parser__advance /src/octofuzz/src/runtime/parser.c:1123:38
    #7 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #8 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #9 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #10 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #11 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #12 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #13 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #14 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
Indirect leak of 512 byte(s) in 4 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x564823 in parser__lex /src/octofuzz/src/runtime/parser.c:427:14
    #5 0x55e701 in parser__get_lookahead /src/octofuzz/src/runtime/parser.c:556:12
    #6 0x55bc07 in parser__advance /src/octofuzz/src/runtime/parser.c:1090:21
    #7 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #8 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #9 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #10 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #11 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #12 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #13 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #14 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 384 byte(s) in 3 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x56dd04 in ts_tree_make_node /src/octofuzz/src/runtime/tree.c:378:18
    #5 0x55fbd0 in parser__reduce /src/octofuzz/src/runtime/parser.c:665:20
    #6 0x567fa1 in parser__do_all_potential_reductions /src/octofuzz/src/runtime/parser.c:837:7
    #7 0x562a6f in parser__handle_error /src/octofuzz/src/runtime/parser.c:915:15
    #8 0x55c2c0 in parser__advance /src/octofuzz/src/runtime/parser.c:1162:7
    #9 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #10 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #11 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #12 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #13 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #14 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #15 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #16 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 192 byte(s) in 3 object(s) allocated from:
    #0 0x4dd380 in calloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:97
    #1 0x5737d4 in ts_calloc /src/octofuzz/src/runtime/alloc.h:58:18
    #2 0x573934 in array__grow /src/octofuzz/src/runtime/array.h:93:24
    #3 0x57732b in stack__iter /src/octofuzz/src/runtime/stack.c:299:13
    #4 0x57732b in ts_stack_pop_count /src/octofuzz/src/runtime/stack.c:427
    #5 0x55f96c in parser__reduce /src/octofuzz/src/runtime/parser.c:652:24
    #6 0x567fa1 in parser__do_all_potential_reductions /src/octofuzz/src/runtime/parser.c:837:7
    #7 0x562a6f in parser__handle_error /src/octofuzz/src/runtime/parser.c:915:15
    #8 0x55c2c0 in parser__advance /src/octofuzz/src/runtime/parser.c:1162:7
    #9 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #10 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #11 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #12 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #13 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #14 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #15 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #16 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x56e335 in ts_tree_make_missing_leaf /src/octofuzz/src/runtime/tree.c:408:18
    #5 0x562a1f in parser__handle_error /src/octofuzz/src/runtime/parser.c:907:32
    #6 0x55c2c0 in parser__advance /src/octofuzz/src/runtime/parser.c:1162:7
    #7 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #8 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #9 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #10 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #11 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #12 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #13 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #14 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 128 byte(s) in 2 object(s) allocated from:
    #0 0x4dd380 in calloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:97
    #1 0x5737d4 in ts_calloc /src/octofuzz/src/runtime/alloc.h:58:18
    #2 0x573934 in array__grow /src/octofuzz/src/runtime/array.h:93:24
    #3 0x57732b in stack__iter /src/octofuzz/src/runtime/stack.c:299:13
    #4 0x57732b in ts_stack_pop_count /src/octofuzz/src/runtime/stack.c:427
    #5 0x55f96c in parser__reduce /src/octofuzz/src/runtime/parser.c:652:24
    #6 0x55bf87 in parser__advance /src/octofuzz/src/runtime/parser.c:1123:38
    #7 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #8 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #9 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #10 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #11 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #12 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #13 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #14 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x56dd04 in ts_tree_make_node /src/octofuzz/src/runtime/tree.c:378:18
    #5 0x55fbd0 in parser__reduce /src/octofuzz/src/runtime/parser.c:665:20
    #6 0x55bf87 in parser__advance /src/octofuzz/src/runtime/parser.c:1123:38
    #7 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #8 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #9 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #10 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #11 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #12 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #13 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #14 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x4dd168 in malloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x568d9b in ts_malloc /src/octofuzz/src/runtime/alloc.h:49:18
    #2 0x56aa4b in ts_tree_pool_allocate /src/octofuzz/src/runtime/tree.c:151:12
    #3 0x56abf8 in ts_tree_make_leaf /src/octofuzz/src/runtime/tree.c:167:18
    #4 0x56dd04 in ts_tree_make_node /src/octofuzz/src/runtime/tree.c:378:18
    #5 0x56e098 in ts_tree_make_error_node /src/octofuzz/src/runtime/tree.c:396:18
    #6 0x567112 in parser__recover_to_state /src/octofuzz/src/runtime/parser.c:998:21
    #7 0x561133 in parser__recover /src/octofuzz/src/runtime/parser.c:1039:11
    #8 0x55c6c6 in parser__advance /src/octofuzz/src/runtime/parser.c:1144:11
    #9 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #10 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #11 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #12 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #13 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #14 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #15 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #16 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

Indirect leak of 128 byte(s) in 2 object(s) allocated from:
    #0 0x4dd380 in calloc /src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:97
    #1 0x5737d4 in ts_calloc /src/octofuzz/src/runtime/alloc.h:58:18
    #2 0x573934 in array__grow /src/octofuzz/src/runtime/array.h:93:24
    #3 0x57732b in stack__iter /src/octofuzz/src/runtime/stack.c:299:13
    #4 0x57732b in ts_stack_pop_count /src/octofuzz/src/runtime/stack.c:427
    #5 0x55f96c in parser__reduce /src/octofuzz/src/runtime/parser.c:652:24
    #6 0x567fa1 in parser__do_all_potential_reductions /src/octofuzz/src/runtime/parser.c:837:7
    #7 0x562808 in parser__handle_error /src/octofuzz/src/runtime/parser.c:884:3
    #8 0x55c2c0 in parser__advance /src/octofuzz/src/runtime/parser.c:1162:7
    #9 0x55adca in parser_parse /src/octofuzz/src/runtime/parser.c:1234:9
    #10 0x55352f in ts_document_parse_with_options /src/octofuzz/src/runtime/document.c:137:16
    #11 0x5182eb in LLVMFuzzerTestOneInput /src/octofuzz/test/fuzz/fuzzer.cc:21:3
    #12 0x5a470e in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:463:13
    #13 0x584145 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /src/libfuzzer/FuzzerDriver.cpp:273:6
    #14 0x58f3df in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:689:9
    #15 0x5837e8 in main /src/libfuzzer/FuzzerMain.cpp:20:10
    #16 0x7f184626682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

SUMMARY: AddressSanitizer: 1856 byte(s) leaked in 18 allocation(s).

Testcases which trigger the leak for bash (hexdumped):

00000000  3b 2a 3b 74 3b 2a 3b 74  3b 79 3b 2a 2a 3b 64     |;*;t;*;t;y;**;d|

and

00000000  26 3c 3c 3c 2d 3b 8f 26  3b 01 26 1a 0a 0a 3b     |&<<<-;.&;.&...;|

Bisecting the recent history, the leak seems to be introduced in 134c455 as part of #128

Question about files that contain multiple languages

Hey there!

I am late to the party, however I like what youโ€™re doing here @maxbrunsfeld ๐Ÿ‘ .

A question that crossed my mind is how tree-sitter could be useful for files that contain multiple languages. A typical example could be a foo.html file that contain HTML, Javascript and CSS; or a bar.js file that contain Javascript mixed with JSX.

I am really liking the idea of incremental parsing, and I also realize maybe โ€œlanguage detectionโ€ isnโ€™t a concern here; but I wonder how would, for instance, GitHub handle syntax highlighting using tree-sitter and detect to โ€œswitchโ€œ to a different grammer for various parts of a file.

Any comments?

Thank you. ๐Ÿ™

Precedence works in mysterious ways

Refs tree-sitter/tree-sitter-ruby#16 (comment)

  1. The static conflict resolution algorithm could use some refinement. The code that checks for expected conflicts and displays the unexpected conflicts to the user actually does a better job of analyzing the conflict than the resolution algorithm.
  2. Currently, since precedence only affects how tightly adjacent elements of a sequence bind to each other, precedence has no effect on singleton production like: lhs: $ => $.variable. I explicitly made it this way, but it turns out to be dumb.

Creating an alias that matches an existing rule name creates a duplicate symbol

It's common to alias nodes such that they take on the name of another existing rule:

rule_a: $ => /[a-z]+/,

rule_b: $ => seq(
  $.rule_a,
  alias($._another_rule, $.rule_a),
)

When you do this, the aliased node (internally _another_rule, appearing as rule_a), will have a different value for ts_node_symbol from the other rule_a node, even though they both have the same value for ts_node_type.

Hang when parsing some Go code with an error

Atom is hanging when editing this code:

package http

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "strings"

    "github.com/pivotalservices/ignition/cloudfoundry"
    "github.com/pivotalservices/ignition/http/session"
    "github.com/pivotalservices/ignition/user"
)

func userFromContext(ctx context.Context) (userID string, accountName string, err) {
    profile, err := user.ProfileFromContext(req.Context())
    if err != nil {profile, e
        rr := user.ProfileFromContext(req.Context())
    if err != nil {
        
    }
}

func organizationHandler(appsURL string, orgPrefix string, quotaID string, q cloudfoundry.OrganizationQuerier) http.Handler {
    fn := func(w http.ResponseWriter, req *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        profile, err := user.ProfileFromContext(req.Context())
        if err != nil {
            log.Println(err)
            w.WriteHeader(http.StatusNotFound)
            return
        }
        userID, err := session.UserIDFromContext(req.Context())
        if err != nil || strings.TrimSpace(userID) == "" {
            log.Println(err)
            w.WriteHeader(http.StatusNotFound)
            return
        }
        o, err := cloudfoundry.OrgsForUserID(userID, appsURL, q)
        if err != nil {
            log.Println(err)
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        if len(o) == 0 {
            w.WriteHeader(http.StatusNotFound)
            return
        }

        expected := orgName(orgPrefix, profile.AccountName)
        var quotaMatches []cloudfoundry.Organization
        for i := range o {
            if strings.EqualFold(quotaID, o[i].QuotaDefinitionGUID) {
                quotaMatches = append(quotaMatches, o[i])
            }
            if strings.EqualFold(expected, o[i].Name) {
                w.WriteHeader(http.StatusOK)
                json.NewEncoder(w).Encode(o[i])
                return
            }
        }

        if len(quotaMatches) == 0 {
            w.WriteHeader(http.StatusNotFound)
            return
        }

        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(quotaMatches[0])
    }
    return http.HandlerFunc(fn)
}

func orgName(orgPrefix string, accountName string) string {
    orgPrefix = strings.ToLower(orgPrefix)
    accountName = strings.ToLower(accountName)
    if strings.Contains(accountName, "@") {
        components := strings.Split(accountName, "@")
        return fmt.Sprintf("%s-%s", orgPrefix, components[0])
    }

    if strings.Contains(accountName, "\\") {
        components := strings.Split(accountName, "\\")
        return fmt.Sprintf("%s-%s", orgPrefix, components[1])
    }

    return fmt.Sprintf("%s-%s", orgPrefix, accountName)
}

Reported by @joefitzgerald

Provide callbacks for node construction, skipping ts_node APIs

It would be convenient for some purposes to be able to provide one or more callbacks which construct a parse tree, rather than waiting for TSNodes to be constructed and then mapping them into some other parse tree.

This would allow more immediate results, plus lower resource consumption, at the cost of losing the editing features &c of the ts_node_* APIs.

Expressing permutations

Tree-sitter currently thinks that

const const const const const const int a = 0;

is valid C, and that

urbuuuubr"Hello, world!"

is a Python string literal. The root cause for both issues is the difficulty of expressing a permutation of rules, which is why the rule matching a string literal in the Python grammar starts with

string: $ => token(seq(
  repeat(choice('u', 'r', 'b')),
  ...

Qualifiers, access modifiers and prefixes are tokens that programming languages commonly allow to occur in any order but without repetition. It would be nice to be able to express such constructs concisely in tree-sitter grammars, perhaps as

string: $ => token(seq(
  permutation(optional('u'), optional('r'), optional('b')),
  ...

This is something most parser generators struggle with, owing to a theoretical limitation of CFGs that requires combinatorial expansion to describe permutations. However, tree-sitter already supports (non-CFG) externally parsed rules and I wonder if it might not be possible to track the required state directly in the parser library to support PERMUTATION as a new rule primitive.

If that is not feasible, I suggest the addition of a permutation rule to the DSL that simply expands

permutation(a, b, c)

into

choice(
  seq(a, b, c),
  seq(a, c, b),
  seq(b, a, c),
  seq(c, a, b),
  seq(b, c, a),
  seq(c, b, a)
)

When the number of arguments is less than 5, the size of the rule tree will still be acceptable in most cases, and at the level of the DSL the expression is both compact and correct.


This issue was found using an experimental fuzzer called tree-saw. Permutation rules being expressed as repeats is one of the most common reasons why programs generated from tree-sitter grammars are syntactically invalid.

tree-saw finds many more similar and unrelated issues in all current grammars, but I won't spam you with those yet because I don't know whether "type II" correctness is even a goal you are interested in (personally, I think it would be fantastic if tree-sitter could reliably indicate syntax errors while typing, or if tree-sitter grammars could double as compiler fuzzers, augmenting the incomplete afl dictionaries that are common now).

Improve randomized test coverage

  • In a given test, perform a sequence of edits instead of just one.
  • After each edit, compare the incrementally-computed tree to a freshly-computed one.
  • Reseed the random number generator based on the current time for each test.

OOM in `parser__do_all_potential_reductions`

I'm not sure if this is related to #132. It can be reproduced with 4212ea5. Again, I've only seen this OOM using the tree-sitter-ruby grammar:

    /home/philipturnbull/src/tree-sitter/out/ruby_fuzzer: Running 1 inputs 1 time(s) each.
    Running: ./oom-1
    ==4257== ERROR: libFuzzer: out-of-memory (used: 1051Mb; limit: 1024Mb)
       To change the out-of-memory limit use -rss_limit_mb=<N>

    Live Heap Allocations: 1086637843 bytes in 13638 chunks; quarantined: 669298 bytes in 104 chunks; 7256 other chunks; total chunks: 20998; showing top 95% (at most 8 unique contexts)
    1055741504 byte(s) (97%) in 13495 allocation(s)
        0 0x4c180a in calloc /b/build/slave/linux_upload_clang/build/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cc:97:3
        1 0xa952fe in ts_calloc (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa952fe)
        2 0xa9492e in ts_tree_array_copy (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa9492e)
        3 0xaf1479 in ts_stack_pop_count (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xaf1479)
        4 0xa5edbe in parser__reduce (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa5edbe)
        5 0xa901af in parser__do_all_potential_reductions (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa901af)
        6 0xa6dc3d in parser__handle_error (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa6dc3d)
        7 0xa4e217 in parser__advance (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa4e217)
        8 0xa457a0 in parser_parse (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa457a0)
        9 0xa1461c in ts_document_parse_with_options (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0xa1461c)
        10 0x4f135a in LLVMFuzzerTestOneInput (/home/philipturnbull/src/tree-sitter/out/ruby_fuzzer+0x4f135a)
        11 0xb42d72 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerLoop.cpp:517:13
        12 0xb35f4a in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerDriver.cpp:280:3
        13 0xb3a708 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerDriver.cpp:703:9
        14 0xb35ca0 in main /home/philipturnbull/src/compiler-rt/lib/fuzzer/./FuzzerMain.cpp:20:10
        15 0x7f092214f82f in __libc_start_main /build/glibc-bfm8X4/glibc-2.23/csu/../csu/libc-start.c:291

    SUMMARY: libFuzzer: out-of-memory

Looks like an infinite loop or exponential blow-up from handling the parse error. Maybe we need some more MAX_ITERATOR_COUNT guards in stack__iter?

Grammar construct to match EOF / end of file / document

In our project, we have a couple of grammar productions that basically say "do this until new-line or EOF". It's a mostly line-based grammar, where we don't like to error on the last line if it doesn't have a trailing newline.

Does that exist right now?

Crash during error recovery

Parsing this file with the JavaScript grammar:

/* eslint-disable github/no-flow-weak */
/* @flow weak */

import cast from '../typecast'

function EventSource() { }
function CanvasView(canvas) { }
function Flamechart(canvas, data, dataRange, info) { }
function MainFlamechart(canvas, data, dataRange, info) { }
function OverviewFlamechart(container, viewportOverlay, data, dataRange, info) { }

OverviewFlamechart.prototype.handleDragGesture = function(e) {
  rect.width = Math.max(self.viewport.width / 1000, (maxX - minX) / self.width * self.viewport.width)
  rect.y = Math.max(self.viewport.y, Math.min(self.viewport.height - self.viewport.y, currentY / self.height * self.viewport.height + self.viewport.y - rect.height / 2))
}

OverviewFlamechart.prototype.onOverlayMouseDown = function(e) { }

OverviewFlamechart.prototype.handleOverlayDragGesture = function(e) {
  const deltaX = (e.clientX - self.overlayDragInfo.mouse.x) / self.width * self.viewport.width
  const deltaY = (e.clientY - self.overlayDragInfo.mouse.y) / self.height * self.viewport.height
}

[tracking] Bugs found with -fsanitize=memory

Tracking issue for crashes found with oss-fuzz, using libFuzzer and -fsanitize=memory. The bugs are found by fuzzing different languages but the crashes are typically in the core tree-sitter runtime.

tree-sitter-javascript:

fixed in issue # Type halt_on_error? minimized base64 reproducer
#101 timeout no bGVha2NyX292ZXI8PDw8azy8PHMPPDoyMHV1dXV1bm59bm5+fn5SfmVijX5+fn5+fn5+djItLS0tLS0tLS0tLT09PT09Njf5LWVl
#101 crash no Nm4KOzQoZCE0
#101 crash no On19fX0zNy0tLS0tLS0tLS0tLTMtLS0tLS0tLS0tLS0t
#101 crash no fXt7e3t7D319fX04fWY=
#101 crash no aChf//////+VaJWVlZWVaGxoKF9oymhoIWhoaGg7aGiVYjc0Mls3NDJkMzA0MtE3NWI0MCBlMTAyOTh9MjKVlZU3LTKVLDQyNjU3MjFmlWKhaGhoaGg7aGhoaApobCkoX2ho6WhoIWhoaGg7aGiVYjc0Mls3NDJkMzA0MtE3NWI0MAoKMjYKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoK
#101 crash no Yjc2OWI0ZbNmYWRhNjMyOTFjMzRidGoKbDJrbWVudWVucilpbWVudWV4cGWqaXJpO7CwsLCwsLAzsP9v////eHBlcmk/ZWopAmJzeHBlZfhwZXJpP2VudApqb2JzcmltZW51bnQKamV4ZWpvYnN4//////////////9pO5uRiztyZX5+fn5+fn5+fn5yaT9lZQ==

tree-sitter-ruby:

fixed in issue # Type halt_on_error? minimized base64 reproducer
#101 crash yes JCQkJCQkJCQkMjEze2U6ZjYkN2U6NGYNDQ04 CmQKZWMoZTU6dChzITRkCjgA KCQ2ZDk6MGYwMDs1YzJmM2I7YjQhZDM1OmYyOTZhYik XyhfO18yKDY5Ywox
#101 timeout yes IWNbOyF0CX50PXI/dD1yP3RjfnR0Zml4PXI/dHR0PTI1LT8/LT8/Pz8/ djghZTg6NHQ9IT9pbSFlcyF1IXYsbSFtIXQ9MjUsIXUhdixtIXQ9MjUsLCE=

/cc @maxbrunsfeld

QUESTION: Which libcompiler.a should I use?

Hi,

I'm getting started with tree-sitter and I've been running on a couple issues while trying to build bindings for Go.

In order to build the library I created a Docker image with the next Dockerfile:

FROM ubuntu

RUN apt-get update && apt-get install -y git python make build-essential
ADD . /usr/local/tree-sitter
WORKDIR /usr/local/tree-sitter
RUN script/configure && make

Once I have this, following the instructions on README.md to create a parser fails because there's more than one libcompiler.a file under out:

$ find out/Release -name libruntime.a
out/Release/libruntime.a
out/Release/obj.target/libruntime.a

Which one is the correct one? Is this expected?

Thanks!

Tree-sitter and semantic highlighting

As far as I understand, tree-sitter currently produces an AST that for highlighting purposes will be recursively collapsed into scope lists which in an editor like Atom will then become class lists to which styling can be applied. In other words, the resulting markup as a basis for highlighting is conceptually identical to that generated by TextMate grammars, while the processing is hopefully more accurate, detailed (more fine-grained scope differentiation), and performant.

Unfortunately, a flattened scope tree is missing information required for semantic highlighting, a form of highlighting that differentiates identifiers of the same type, such as two variables declared in succession. The problem is that tokens of the same type at the same level in the scope hierarchy are assigned the same class and thus cannot be highlighted differently.

I have tried in the past to work around this problem by intercepting the output of an external parser and generating ad-hoc scopes based on the token text. This was always a hack, and I have since given up work on that package because it relies on a private Atom API that has repeatedly changed in the past and broken the package as a result.

But now tree-sitter is coming, and this could be a golden opportunity to enable semantic highlighting for just about any language.


The idea is as follows: From a program like

var one = 1
var two = 2

tree-sitter currently generates an AST resembling

program
  var_declaration
    identifier
    number
  var_declaration
    identifier
    number

The editor translates the AST into token annotations which result in something like

<div class="program">
  <div class="var_declaration">
    var <div class="identifier">one</div> = <div class="number">1</div>
  </div>
  <div class="var_declaration">
    var <div class="identifier">two</div> = <div class="number">2</div>
  </div>
</div>

(I'm aware that's not how Atom's DOM actually looks, but the concept is the same).

We're now stuck, because both one and two have the same DOM path .program .var_declaration .identifier and thus any stylesheet will highlight them the same, which is the opposite of what we want in semantic highlighting.

The change needed for semantic highlighting happiness is that the component taking source + AST and producing annotated tokens (which I guess is tree-sitter-syntax) maintains, at each level of the scope hierarchy, an indexed list of identifier names and appends index-[INDEX] to the class list, turning the above into

<div class="program">
  <div class="var_declaration">
    var <div class="identifier index-1">one</div> = <div class="number">1</div>
  </div>
  <div class="var_declaration">
    var <div class="identifier index-2">two</div> = <div class="number">2</div>
  </div>
</div>

Each identifier can now be assigned a color that is unique within the scope in which the identifier is valid. There are of course corner cases such as imported identifiers or languages having a global construct of sorts that allows breaking out of a syntax scope but in the overwhelming majority of cases this should provide accurate semantic highlighting in the sense that "same color in same scope = semantically identical identifier". The idea could be extended further by indexing scopes as well, effectively recreating some of the original parsing information in the class hierarchy.


I wasn't quite sure which repo to post this to. There might not be any changes needed to tree-sitter itself but I don't know the project well enough to know this for sure. Please feel free to move this issue somewhere else if you feel it is more appropriate there.

Thoughts on detecting invalidated ranges

I've been thinking all morning about the problem of detecting invalidated ranges. A key realization I've had is that the problem is very different from computing a diff because we don't need to describe the nature of the change, but just its range. That means we can walk the trees linearly and record invalidations whenever we encounter differences between the children of a given node, even if we don't know the cause of said differences. A slight wrinkle is that we only want to consider differences in the subset of nodes that affect syntax highlighting for purposes of invalidation. I think the following pseudocode might get us close:

recordInvalidatedRanges(oldNode, newNode, ranges)

Starting at the root of the old and new trees:

  • Are these nodes structurally equivalent (both nodes are non-null, have the same type and spatial footprint)?
    • Yes:
      • Does the node in the old tree contain a textual change?
        • No:
          • Terminate the recursion.
        • Yes:
          • Call recordInvalidatedRanges recursively on corresponding children of both nodes. If the number of children differ, pass null to represent the missing child from the shorter list.
    • No:
      • Do either of the nodes have a type that affects highlighting?
        • Yes:
          • Union the spatial footprint of both nodes and add it to the invalidation list, then terminate the recursion.
        • No:
          • Collect the descendants of both nodes that affect highlighting and call recordInvalidatedRanges recursively on the collected lists. If the length of these lists differ, pass null to represent the missing child from the shorter list.

tree-sitter tutorial

Dear Github team,
I want to produce a tree-sitter for R language. Could give me some tips or links to a tutorial ?

Best,

linker error "cannot find -lcompiler "

How to reproduce this issue:

  1. clone this repository e.g ~/test/tree-sitter
  2. run script/configure and make
  3. create arithmetic_grammar.cc as instructed in README.md at e.g ~/test/arithmetic_grammar.cc
  4. run code below at ~/test
clang++ -std=c++11 \
  -I tree-sitter/include \
  -L tree-sitter/out/Release \
  -l compiler \
  arithmetic_grammar.cc \
  -o arithmetic_grammar -v

output:

clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/6.0.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.0.0
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/6.0.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6.0.0
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
 "/usr/lib/llvm-3.8/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -mrelax-all -disable-free -disable-llvm-verifier -main-file-name arithmetic_grammar.cc -mrelocation-model static -mthread-model posix -mdisable-fp-elim -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -fuse-init-array -target-cpu x86-64 -v -dwarf-column-info -debugger-tuning=gdb -resource-dir /usr/lib/llvm-3.8/bin/../lib/clang/3.8.0 -I tree-sitter/include -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0 -internal-isystem /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/backward -internal-isystem /usr/local/include -internal-isystem /usr/lib/llvm-3.8/bin/../lib/clang/3.8.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -std=c++11 -fdeprecated-macro -fdebug-compilation-dir /home/jzhu/go/src/github.com/codelingo/sandbox/test -ferror-limit 19 -fmessage-length 128 -fobjc-runtime=gcc -fcxx-exceptions -fexceptions -fdiagnostics-show-option -fcolor-diagnostics -o /tmp/arithmetic_grammar-a3ca1f.o -x c++ arithmetic_grammar.cc
clang -cc1 version 3.8.0 based upon LLVM 3.8.0 default target x86_64-pc-linux-gnu
ignoring nonexistent directory "/include"
ignoring duplicate directory "/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0"
#include "..." search starts here:
#include <...> search starts here:
 tree-sitter/include
 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0
 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0
 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/backward
 /usr/local/include
 /usr/lib/llvm-3.8/bin/../lib/clang/3.8.0/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
 "/usr/bin/ld" -z relro --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o arithmetic_grammar /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../x86_64-linux-gnu/crt1.o /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/crtbegin.o -Ltree-sitter/out/Release -L/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../x86_64-linux-gnu -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../.. -L/usr/lib/llvm-3.8/bin/../lib -L/lib -L/usr/lib -lcompiler /tmp/arithmetic_grammar-a3ca1f.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/crtend.o /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../x86_64-linux-gnu/crtn.o
/usr/bin/ld: cannot find -lcompiler
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Flaky failure in two javascript tests

I ran across this by chance on a script/test run. Glad the test script prints out its random seed!

$ script/test -s 1509425801
make: Nothing to be done for 'tests'.
Random seed: 1509425801
Executed 32654 tests. 32653 succeeded. 2 failed.
There were failures!

the javascript language parses one invalid subtree right after the viable prefix: repairing an insertion of "eydcelnt(]{" at 31:
Found changed scope outside of any invalidated range;
Position: {2, 26}
Byte index: 52
Line:   x = function(a) { b; } function(c) { d; }
                                ^
Old scopes: (vector: program, if_statement, statement_block, expression_statement, ERROR, function, function, 'u')
New scopes: (vector: program, if_statement, statement_block, expression_statement, assignment_expression, function, function, 'u')
Invalidated ranges:
  {{2, 5}, {2, 25}}


the javascript language parses one invalid subtree right after the viable prefix: repairing an insertion of "eydcelnt(]{" at 31:
test/integration/real_grammars.cc:47: Expected: of length 0
Actual: [ 339 ]


Test run complete. 32654 tests run. 32653 succeeded. 2 failed.

real	0m21.685s
user	0m21.152s
sys	0m0.300s

Make it easy to define identifiers that allow unicode character

Many languages (e.g. Ruby and Go) allow identifiers to include unicode characters, but right now there's not an easy way to define these in the grammar. tree-sitter-go manually allows some greek characters and tree-sitter-ruby fails to parse valid code like this constant assignment: C๐Ÿ˜€ = 1 because it just matches ascii character classes with it's regexes.

Investigate why the ruby parser is so large

For some reason, the ruby parser is much larger than any of the other languages' parsers. Most importantly, the lexer function is unusually large, which causes the parser to take a long time to compile.

I need to figure out what exactly it is about the ruby grammar that's causing this, and make whatever changes are necessary to tree-sitter itself to reduce the size of the generated parser.

Allow non-terminal extras

It should be possible to allow non-terminal symbols to be used as extras as long as those non-terminals begin with 'distinctive' tokens that don't appear anywhere else in the grammar.

This would greatly simplify the handling of heredocs in Ruby. Heredoc ends are non-terminals because they can contain interpolated expressions, and they can basically appear anywhere. Currently, we explicitly allow them in a bunch of places, which complicates the grammar.

Refs #114

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.