Giter Site home page Giter Site logo

Comments (9)

ObserverOfTime avatar ObserverOfTime commented on May 27, 2024

Are you sure this is a py-tree-sitter issue? What about node-tree-sitter?

from py-tree-sitter.

milahu avatar milahu commented on May 27, 2024

Are you sure this is a py-tree-sitter issue?

yepp, i suspect the conversion between native C API and python types is slow

What about node-tree-sitter?

not tested, maybe its also slower than lezer-parser

i prefer lezer-parser because its a pure javascript parser
so no need to mess with WASM

from py-tree-sitter.

ObserverOfTime avatar ObserverOfTime commented on May 27, 2024

node-tree-sitter is not WASM. web-tree-sitter is WASM.

from py-tree-sitter.

milahu avatar milahu commented on May 27, 2024

yeah

still, its surprising that native code with script bindings
is 10x slower than pure scripting code

so i guess there is something wrong with the bindings

from py-tree-sitter.

narpfel avatar narpfel commented on May 27, 2024

Building bytes objects using += is quadratic because bytes are immutable, so each += has to copy the entire accumulated bytes object. Are you sure that youโ€™re not measuring that?

from py-tree-sitter.

milahu avatar milahu commented on May 27, 2024

aah! good catch. let me rewrite that with io.BytesIO

from py-tree-sitter.

narpfel avatar narpfel commented on May 27, 2024

Accumulating into a list and then bytes.joining would also work.

from py-tree-sitter.

ObserverOfTime avatar ObserverOfTime commented on May 27, 2024

Could we use strings instead of bytes (or both)? Is there any chance we'll need to parse binary files?

from py-tree-sitter.

milahu avatar milahu commented on May 27, 2024

html_parser.parse(input_html_bytes) requires a bytestring
because tree-sitter does not know utf8

let me rewrite that with io.BytesIO

yepp. now py-tree-sitter is 2x faster than lezer-parser

fixed.py
#!/usr/bin/env python3

# usage:
# python3 test.py input.html

# https://github.com/tree-sitter/py-tree-sitter/issues/202
# py-tree-sitter is 10x slower than lezer-parser

import sys
import time
import io

import tree_sitter
import tree_sitter_languages

tree_sitter_html = tree_sitter_languages.get_parser("html")



# https://github.com/tree-sitter/py-tree-sitter/issues/33
def walk_html_tree(tree):
    # compound tags
    # these are ignored when serializing the tree
    compound_kind_id = (
        25, # fragment
        26, # doctype
        28, # element
        29, # script_element
        30, # style_element
        31, # start_tag
        34, # self_closing_tag
        35, # end_tag
        37, # attribute
        38, # quoted_attribute_value
    )
    cursor = tree.walk()
    reached_root = False
    while reached_root == False:
        is_compound = cursor.node.kind_id in compound_kind_id
        if not is_compound:
            yield cursor.node
        if cursor.goto_first_child():
            continue
        if cursor.goto_next_sibling():
            continue
        retracing = True
        while retracing:
            if not cursor.goto_parent():
                retracing = False
                reached_root = True
            if cursor.goto_next_sibling():
                retracing = False



def main():

    input_path = sys.argv[1]

    # tree-sitter expects binary string
    with open(input_path, 'rb') as f:
        input_html_bytes = f.read()

    html_parser = tree_sitter_html

    t1 = time.time()

    html_tree = html_parser.parse(input_html_bytes)

    root_node = html_tree.root_node

    # test the tree walker
    # this test run should return
    # the exact same string as the input string
    # = lossless noop

    print(f"testing walk_html_tree on {len(input_html_bytes)} bytes of html")

    # slow!
    #walk_html_tree_test_result = b""
    walk_html_tree_test_result = io.BytesIO()

    last_node_to = 0

    for node in walk_html_tree(root_node):
        # walk_html_tree_test_result += (
        #     input_html_bytes[last_node_to:node.range.end_byte]
        # )
        walk_html_tree_test_result.write(
            input_html_bytes[last_node_to:node.range.end_byte]
        )
        last_node_to = node.range.end_byte

    # copy whitespace after last node
    # fix: missing newline at end of file
    # walk_html_tree_test_result += (
    #     input_html_bytes[last_node_to:]
    # )
    walk_html_tree_test_result.write(
        input_html_bytes[last_node_to:]
    )

    t2 = time.time()
    print(f"testing walk_html_tree done in {t2 - t1} seconds")

    walk_html_tree_test_result = walk_html_tree_test_result.getvalue()

    assert walk_html_tree_test_result == input_html_bytes

    print('ok. the tree walker is lossless')

main()

sorry for the noise, thanks for the help : )

from py-tree-sitter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.