Comments (9)
Are you sure this is a py-tree-sitter issue? What about node-tree-sitter?
from py-tree-sitter.
Are you sure this is a py-tree-sitter issue?
yepp, i suspect the conversion between native C API and python types is slow
What about node-tree-sitter?
not tested, maybe its also slower than lezer-parser
i prefer lezer-parser because its a pure javascript parser
so no need to mess with WASM
from py-tree-sitter.
node-tree-sitter is not WASM. web-tree-sitter is WASM.
from py-tree-sitter.
yeah
still, its surprising that native code with script bindings
is 10x slower than pure scripting code
so i guess there is something wrong with the bindings
from py-tree-sitter.
Building bytes
objects using +=
is quadratic because bytes
are immutable, so each +=
has to copy the entire accumulated bytes
object. Are you sure that youโre not measuring that?
from py-tree-sitter.
aah! good catch. let me rewrite that with io.BytesIO
from py-tree-sitter.
Accumulating into a list and then bytes.join
ing would also work.
from py-tree-sitter.
Could we use strings instead of bytes (or both)? Is there any chance we'll need to parse binary files?
from py-tree-sitter.
html_parser.parse(input_html_bytes)
requires a bytestring
because tree-sitter does not know utf8
let me rewrite that with io.BytesIO
yepp. now py-tree-sitter is 2x faster than lezer-parser
fixed.py
#!/usr/bin/env python3
# usage:
# python3 test.py input.html
# https://github.com/tree-sitter/py-tree-sitter/issues/202
# py-tree-sitter is 10x slower than lezer-parser
import sys
import time
import io
import tree_sitter
import tree_sitter_languages
tree_sitter_html = tree_sitter_languages.get_parser("html")
# https://github.com/tree-sitter/py-tree-sitter/issues/33
def walk_html_tree(tree):
# compound tags
# these are ignored when serializing the tree
compound_kind_id = (
25, # fragment
26, # doctype
28, # element
29, # script_element
30, # style_element
31, # start_tag
34, # self_closing_tag
35, # end_tag
37, # attribute
38, # quoted_attribute_value
)
cursor = tree.walk()
reached_root = False
while reached_root == False:
is_compound = cursor.node.kind_id in compound_kind_id
if not is_compound:
yield cursor.node
if cursor.goto_first_child():
continue
if cursor.goto_next_sibling():
continue
retracing = True
while retracing:
if not cursor.goto_parent():
retracing = False
reached_root = True
if cursor.goto_next_sibling():
retracing = False
def main():
input_path = sys.argv[1]
# tree-sitter expects binary string
with open(input_path, 'rb') as f:
input_html_bytes = f.read()
html_parser = tree_sitter_html
t1 = time.time()
html_tree = html_parser.parse(input_html_bytes)
root_node = html_tree.root_node
# test the tree walker
# this test run should return
# the exact same string as the input string
# = lossless noop
print(f"testing walk_html_tree on {len(input_html_bytes)} bytes of html")
# slow!
#walk_html_tree_test_result = b""
walk_html_tree_test_result = io.BytesIO()
last_node_to = 0
for node in walk_html_tree(root_node):
# walk_html_tree_test_result += (
# input_html_bytes[last_node_to:node.range.end_byte]
# )
walk_html_tree_test_result.write(
input_html_bytes[last_node_to:node.range.end_byte]
)
last_node_to = node.range.end_byte
# copy whitespace after last node
# fix: missing newline at end of file
# walk_html_tree_test_result += (
# input_html_bytes[last_node_to:]
# )
walk_html_tree_test_result.write(
input_html_bytes[last_node_to:]
)
t2 = time.time()
print(f"testing walk_html_tree done in {t2 - t1} seconds")
walk_html_tree_test_result = walk_html_tree_test_result.getvalue()
assert walk_html_tree_test_result == input_html_bytes
print('ok. the tree walker is lossless')
main()
sorry for the noise, thanks for the help : )
from py-tree-sitter.
Related Issues (20)
- Documentation is required.
- Is it possible to understand if an object is an instance of a certain class that may be defined in another file?
- How to use tree-sitter in Windows 10๏ผ HOT 6
- Special case not managed by the parser, when an expression is split without a backslash and the second line is dedented. HOT 4
- Example mismatch HOT 1
- Captures are not grouped HOT 3
- How to sync node after code edit? HOT 1
- README.md needs updates
- Tree-sitter Fails with Core Dump on Processing Large Input Code File HOT 4
- query failing in py-tree-sitter but compiling in tree-sitter playground HOT 1
- a __main__.py to build `vendor/tree-sitter-X` from command line HOT 1
- [bug] Cannot build a library in a directory containing two parsers. HOT 5
- The Python parser appears to be parsing comments and strings with unmatched parentheses as code HOT 2
- FileNotFoundError: [Errno 2] No such file or directory: 'tree-sitter-java/src/parser.c' HOT 2
- UTF-16 encoding support is wanted HOT 1
- Can't install tree-sitter-python HOT 1
- [tree-sitter-python] `No matching distribution found for tree-sitter-python` HOT 1
- Broken (?) README instructions to pip install pre-compiled binary wheels HOT 1
- Python 3.12 support for version 0.21.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from py-tree-sitter.