i was surprised to see that this parser is 10x slower than lezer-parser in java<

Are you sure this is a py-tree-sitter issue? <p dir="au

yeah still, its surprising that native code with bindings

Building bytes objects using <code class="notranslate

aah! good catch. let me rewrite that with <a href="https://docs.python.org/3/library/i

py-tree-sitter is 10x slower than lezer-parser about py-tree-sitter HOT 9 CLOSED

milahu commented on May 27, 2024

py-tree-sitter is 10x slower than lezer-parser

from py-tree-sitter.

Comments (9)

ObserverOfTime commented on May 27, 2024

Are you sure this is a py-tree-sitter issue? What about node-tree-sitter?

from py-tree-sitter.

milahu commented on May 27, 2024

Are you sure this is a py-tree-sitter issue?

yepp, i suspect the conversion between native C API and python types is slow

What about node-tree-sitter?

not tested, maybe its also slower than lezer-parser

i prefer lezer-parser because its a pure javascript parser
so no need to mess with WASM

from py-tree-sitter.

ObserverOfTime commented on May 27, 2024

node-tree-sitter is not WASM. web-tree-sitter is WASM.

from py-tree-sitter.

milahu commented on May 27, 2024

yeah

still, its surprising that native code with script bindings
is 10x slower than pure scripting code

so i guess there is something wrong with the bindings

from py-tree-sitter.

narpfel commented on May 27, 2024

Building bytes objects using += is quadratic because bytes are immutable, so each += has to copy the entire accumulated bytes object. Are you sure that you’re not measuring that?

from py-tree-sitter.

milahu commented on May 27, 2024

aah! good catch. let me rewrite that with io.BytesIO

from py-tree-sitter.

narpfel commented on May 27, 2024

Accumulating into a list and then bytes.joining would also work.

from py-tree-sitter.

ObserverOfTime commented on May 27, 2024

Could we use strings instead of bytes (or both)? Is there any chance we'll need to parse binary files?

from py-tree-sitter.

milahu commented on May 27, 2024

html_parser.parse(input_html_bytes) requires a bytestring
because tree-sitter does not know utf8

let me rewrite that with io.BytesIO

yepp. now py-tree-sitter is 2x faster than lezer-parser

fixed.py

#!/usr/bin/env python3

# usage:
# python3 test.py input.html

# https://github.com/tree-sitter/py-tree-sitter/issues/202
# py-tree-sitter is 10x slower than lezer-parser

import sys
import time
import io

import tree_sitter
import tree_sitter_languages

tree_sitter_html = tree_sitter_languages.get_parser("html")



# https://github.com/tree-sitter/py-tree-sitter/issues/33
def walk_html_tree(tree):
    # compound tags
    # these are ignored when serializing the tree
    compound_kind_id = (
        25, # fragment
        26, # doctype
        28, # element
        29, # script_element
        30, # style_element
        31, # start_tag
        34, # self_closing_tag
        35, # end_tag
        37, # attribute
        38, # quoted_attribute_value
    )
    cursor = tree.walk()
    reached_root = False
    while reached_root == False:
        is_compound = cursor.node.kind_id in compound_kind_id
        if not is_compound:
            yield cursor.node
        if cursor.goto_first_child():
            continue
        if cursor.goto_next_sibling():
            continue
        retracing = True
        while retracing:
            if not cursor.goto_parent():
                retracing = False
                reached_root = True
            if cursor.goto_next_sibling():
                retracing = False



def main():

    input_path = sys.argv[1]

    # tree-sitter expects binary string
    with open(input_path, 'rb') as f:
        input_html_bytes = f.read()

    html_parser = tree_sitter_html

    t1 = time.time()

    html_tree = html_parser.parse(input_html_bytes)

    root_node = html_tree.root_node

    # test the tree walker
    # this test run should return
    # the exact same string as the input string
    # = lossless noop

    print(f"testing walk_html_tree on {len(input_html_bytes)} bytes of html")

    # slow!
    #walk_html_tree_test_result = b""
    walk_html_tree_test_result = io.BytesIO()

    last_node_to = 0

    for node in walk_html_tree(root_node):
        # walk_html_tree_test_result += (
        #     input_html_bytes[last_node_to:node.range.end_byte]
        # )
        walk_html_tree_test_result.write(
            input_html_bytes[last_node_to:node.range.end_byte]
        )
        last_node_to = node.range.end_byte

    # copy whitespace after last node
    # fix: missing newline at end of file
    # walk_html_tree_test_result += (
    #     input_html_bytes[last_node_to:]
    # )
    walk_html_tree_test_result.write(
        input_html_bytes[last_node_to:]
    )

    t2 = time.time()
    print(f"testing walk_html_tree done in {t2 - t1} seconds")

    walk_html_tree_test_result = walk_html_tree_test_result.getvalue()

    assert walk_html_tree_test_result == input_html_bytes

    print('ok. the tree walker is lossless')

main()

sorry for the noise, thanks for the help : )

from py-tree-sitter.

py-tree-sitter is 10x slower than lezer-parser about py-tree-sitter HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent