Giter Site home page Giter Site logo

semgwt's Introduction

SemGWT

Intro

This is an attempt to approach the problem of extracting GWT RPC method information from generated JavaScript files with Semgrep, without relying on regular expressions to parse complex JavaScript.

Design

At this point we try to provide standalone tools to implement different phases of the extraction process:

  • The preprocessor extracts pure JavaScript code from static cache files provided by GWT applications
  • Semgrep rules are used to extract RPC interface information from the preprocessed JavaScript
  • Post-processing is needed to transform the extracted data to HTTP requests (or other formats useful for the user)

The simple structure of Semgrep rules should allow easy addition of support for different GWT versions and maintenance.

Samples wanted!

One of the hurdles of GWT is the wide variety of cache file formats different versions produce. If you have any samples to share please open a PR or reach out on usual channels!

Performance

One weakness of Semgrep is that it can easily consume (almost) boundless memory. Because of this, rules must be created and tested carefully, and the process needs special parameters for larger files.

In particular, defining code block patterns with ellipses inside (e.g. function $FOO(){var $X; ...}) can have detrimental effect on performance when parsing large inputs, so you should try to avoid this and focus on sequences of statements.

Semgrep has both size and time limits by default. To analyze a 10MB JavaScript file, I had to disable timeouts, and set the size limit above the size of the file:

--timeout 0 \
--max-target-bytes 20 MB

Additionally, to prevent OOM Killer intervention (or system freeze...) it is recommended to set the memory limit. In the above case the semgrep process can easily consume 10 gigabytes of memory:

--max-memory 10240

File sizes are limited by default for a reason: memory requirement of parsing a single large file or orders of magnitude bigger than parsing the same code split into multiple files. It is highly recommended to split the input multiple files during preprocessing. The theoretical cost of this is losing inter-file references, but this shouldn't be a problem in the case of GWT.

Example

Split files

Run the preprocessor:

python3 preprocess.py path/to/HEX.cache.js

This will create a directory named HEX in the current directory with split JavaScript files in it.

Run the Semgrep rules on the files:

semgrep -c SemGWT.yaml --json split_files_dir/ > semgwt.out.json

Run post-processing on the generated JSON:

TODO

Large files

semgrep -c SemGWT.yaml \
--max-target-bytes 20MB \
--timeout 0 \
--max-memory 10240 \
samples/clean

This can take more than 20mins to complete against the said 10MB test case (can't be included in the repo unfortunately for license reasons).

TODO

  • Preprocessing of cache files to get raw JavaScript (in progress)
  • Semgrep output processing to generate proper method signatures or even sample HTTP requests
  • Multi-version support (samples needed!)

semgwt's People

Contributors

dnet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.