Giter Site home page Giter Site logo

marvinmw / codetext-parser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fsoft-ai4code/codetext-parser

0.0 0.0 0.0 155 KB

⚒️ Tree-sitter custom toolkit for extracting function and class from raw source file

License: MIT License

JavaScript 1.33% Ruby 2.28% C++ 0.25% Python 91.75% C 0.27% PHP 1.95% Java 0.83% Go 0.42% C# 0.54% Rust 0.39%

codetext-parser's Introduction

logo

______________________________________________________________________
Branch Build Unittest Release License
main Unittest release pyversion license

Code-Text parser is a custom tree-sitter's grammar parser for extract raw source code into class and function level. We support 10 common programming languages:

  • Python
  • Java
  • JavaScript
  • PHP
  • Ruby
  • Rust
  • C
  • C++
  • C#
  • Go

Installation

codetext package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:

git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .

Or install via pypi package:

pip install codetext

Getting started

codetext CLI Usage

codetext [options] [PATH or FILE] ...

For example extract any python file in src/ folder:

codetext src/ --language Python

If you want to store extracted class and function, use flag --json and give a path to destination file:

codetext src/ --language Python --output_file ./python_report.json --json

Options

positional arguments:
  paths                 list of the filename/paths.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -l LANGUAGE, --language LANGUAGE
                        Target the programming languages you want to analyze.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output file (e.g report.json).
  --json                Generate json output as a transform of the default
                        output
  --verbose             Print progress bar

Example

File circle_linkedlist.py analyzed:
==================================================
Number of class    : 1
Number of function : 2
--------------------------------------------------

Class summary:
+-----+---------+-------------+
|   # | Class   | Arguments   |
+=====+=========+=============+
|   0 | Node    |             |
+-----+---------+-------------+

Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| #   | Method name   | Paramters   | Type   | Return type   |
+=====+===============+=============+========+===============+
| 0   | __init__      | self        |        |               |
|     |               | data        |        |               |
+-----+---------------+-------------+--------+---------------+

Function analyse:
+-----+-----------------+-------------+--------+---------------+
| #   | Function name   | Paramters   | Type   | Return type   |
+=====+=================+=============+========+===============+
| 0   | push            | head_ref    |        | Node          |
|     |                 | data        | Any    | Node          |
| 1   | countNodes      | head        | Node   |               |
+-----+-----------------+-------------+--------+---------------+

Using codetext as Python module

Build your language

codetext need tree-sitter language file (i.e .so file) to work properly. You can manually compile language (see more) or automatically build use our pre-defined function (the <language>.so will saved in a folder name /tree-sitter/):

from codetext.utils import build_language

language = 'rust'
build_language(language)

# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Using Language Parser

Each programming language we supported are correspond to a custome language_parser. (e.g Python is PythonParser()). language_parser take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:

from codetext.utils import parse_code

raw_code = """
    /**
    * Sum of 2 number
    * @param a int number
    * @param b int number
    */
    double sum2num(int a, int b) {
        return a + b;
    } 
"""

# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node

Get all function nodes inside a specific node:

from codetext.utils.parser import CppParser

function_list = CppParser.get_function_list(root_node)
print(function_list)

# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]

metadata = CppParser.get_function_metadata(function, raw_code)

# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)

# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Limitations

codetext heavly depends on tree-sitter syntax:

  • Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. codetext is easily vulnerable by tree-sitter update patch or syntax change in future.

  • While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.

codetext-parser's People

Contributors

minhna1112 avatar namcyan avatar nhtlongcs avatar nmd2k avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.