tdewolff / parse Goto Github PK

Go parsers for web formats

License: MIT License

Go 100.00%

parse go html css svg js json xml lexer javascript

parse's Introduction

Parse

This package contains several lexers and parsers written in Go. All subpackages are built to be streaming, high performance and to be in accordance with the official (latest) specifications.

The lexers are implemented using buffer.Lexer in https://github.com/tdewolff/parse/buffer and the parsers work on top of the lexers. Some subpackages have hashes defined (using Hasher) that speed up common byte-slice comparisons.

Buffer

Reader

Reader is a wrapper around a []byte that implements the io.Reader interface. It is comparable to bytes.Reader but has slightly different semantics (and a slightly smaller memory footprint).

Writer

Writer is a buffer that implements the io.Writer interface and expands the buffer as needed. The reset functionality allows for better memory reuse. After calling Reset, it will overwrite the current buffer and thus reduce allocations.

Lexer

Lexer is a read buffer specifically designed for building lexers. It keeps track of two positions: a start and end position. The start position is the beginning of the current token being parsed, the end position is being moved forward until a valid token is found. Calling Shift will collapse the positions to the end and return the parsed []byte.

Moving the end position can go through Move(int) which also accepts negative integers. One can also use Pos() int to try and parse a token, and if it fails rewind with Rewind(int), passing the previously saved position.

Peek(int) byte will peek forward (relative to the end position) and return the byte at that location. PeekRune(int) (rune, int) returns UTF-8 runes and its length at the given byte position. Upon an error Peek will return 0, the user must peek at every character and not skip any, otherwise it may skip a 0 and panic on out-of-bounds indexing.

Lexeme() []byte will return the currently selected bytes, Skip() will collapse the selection. Shift() []byte is a combination of Lexeme() []byte and Skip().

When the passed io.Reader returned an error, Err() error will return that error even if not at the end of the buffer.

StreamLexer

StreamLexer behaves like Lexer but uses a buffer pool to read in chunks from io.Reader, retaining old buffers in memory that are still in use, and re-using old buffers otherwise. Calling Free(n int) frees up n bytes from the internal buffer(s). It holds an array of buffers to accommodate for keeping everything in-memory. Calling ShiftLen() int returns the number of bytes that have been shifted since the previous call to ShiftLen, which can be used to specify how many bytes need to be freed up from the buffer. If you don't need to keep returned byte slices around, call Free(ShiftLen()) after every Shift call.

Strconv

This package contains string conversion function much like the standard library's strconv package, but it is specifically tailored for the performance needs within the minify package.

For example, the floating-point to string conversion function is approximately twice as fast as the standard library, but it is not as precise.

CSS

This package is a CSS3 lexer and parser. Both follow the specification at CSS Syntax Module Level 3. The lexer takes an io.Reader and converts it into tokens until the EOF. The parser returns a parse tree of the full io.Reader input stream, but the low-level Next function can be used for stream parsing to returns grammar units until the EOF.

See README here.

HTML

This package is an HTML5 lexer. It follows the specification at The HTML syntax. The lexer takes an io.Reader and converts it into tokens until the EOF.

See README here.

JS

This package is a JS lexer (ECMA-262, edition 6.0). It follows the specification at ECMAScript Language Specification. The lexer takes an io.Reader and converts it into tokens until the EOF.

See README here.

JSON

This package is a JSON parser (ECMA-404). It follows the specification at JSON. The parser takes an io.Reader and converts it into tokens until the EOF.

See README here.

SVG

This package contains common hashes for SVG1.1 tags and attributes.

XML

This package is an XML1.0 lexer. It follows the specification at Extensible Markup Language (XML) 1.0 (Fifth Edition). The lexer takes an io.Reader and converts it into tokens until the EOF.

See README here.

License

Released under the MIT license.

parse's People

Contributors

Stargazers

Watchers

Forkers

ciarand couchbasedeps pze tombyrer plazma0 rreverser nilslice gophersgang mantyr praveenmunagapati leobcn renesugar apisit andradeandrey airgateway kiwiflow abhishekspeer wade-welles warp-html-engine yafimk twodotsolutions rogpeppe-contrib blockchyp backwardn draxil chaolinyu mewbak isgasho dtrenin7 richardzhang0301 huacnlee allansson sreekanth370 peter9207 tawawhite karelorigin longbaorg j4ckzh0u qypt15 jimafisk user3751 seasonjs freeonce makenowjust jonasniestroj psilva261 smedrick ooppwwqq0 ezoic nidorx shakahl lotdeef ttttmr mayhemheroes baptistapedro kelub adjective-object oxiao hookttg prodigaleye nichady randydom fobus1289 volker-schukai

parse's Issues

xml.NewLexer changed API breaks tdewolff/canvas from compiling when Go finds newer dependencies

I'm not a big expert on how Go modules select dependencies, I'm only suspecting that this due to both tdewolff/canvas and tdewolff/parse/v2 being used in one project through other modules.

The compiler issues an error:

github.com/tdewolff/canvas
../canvas/latex.go:80:19: cannot use r (type *os.File) as type *parse.Input in argument to xml.NewLexer

I assume this is due to the diamond dependency problem (or whatever it's called), when compiler chooses a newer version for the build out of available options.

In this case it's attempting to choose xml.NewLexer version that is accepting a newer API *parse.Input argument, vs older API's io.Reader, although github.com/tdewolff/parse/v2 v2.4.3 is requested in canvas's go.mod.

I don't see a solution to this other than to patch github.com/tdewolff/canvas, I just did a quick test and it seems working (zzwx-forks/canvas@e51fdb9)

Prevent unnecessary copying due to buffer overwrite

Because the tokenizer overwrites the internal buffer whenever it reaches the end, any slice that has been returned points to that buffer and becomes invalidated. This means the client must copy all slices it has retrieved and needs to keep after calling Next.

In many cases this is not required because the end of the buffer is not reached and thus not overwritten. Either the client can pass a function that is called whenever the buffer is about to be overwritten, so that it can copy any slices it has. Or it can return whether it has any slices at all, and if it does the internal buffer is not overwritten but another block is allocated (not preferred).

This is not the case at the moment, for we use ioutil.ReadAll because the parser is not streaming, but this would become a problem when we stream the CSS file.

Minify XML tests break with 2.4.0

An automated rebuild of minify against parse 2.4.0 fails as:

github.com/tdewolff/minify/xml
--- FAIL: TestXML (0.00s)
    --- FAIL: TestXML/<A>x</A> (0.00s)
    xml_test.go:56:: 
            <A>x</A>
            <Ax</A>
            <A>x</A>
    --- FAIL: TestXML/<a><b>x</b></a> (0.00s)
    xml_test.go:56:: 
            <a><b>x</b></a>
            <a<bx</b></a>
            <a><b>x</b></a>
    --- FAIL: TestXML/<a><b>x_y</b></a> (0.00s)
    xml_test.go:56:: 
            <a><b>x\ny</b></a>
            <a<bx\ny</b></a>
            <a><b>x\ny</b></a>
    --- FAIL: TestXML/<a>_<![CDATA[_a_]]>_</a> (0.00s)
    xml_test.go:56:: 
            <a> <![CDATA[ a ]]> </a>
            <aa</a>
            <a>a</a>
    --- FAIL: TestXML/<a_>a</a_> (0.00s)
    xml_test.go:56:: 
            <a >a</a >
            <aa</a>
            <a>a</a>
    --- FAIL: TestXML/<?xml__version="1.0"_?> (0.00s)
    xml_test.go:56:: 
            <?xml  version="1.0" ?>
            <?xml version="1.0"
            <?xml version="1.0"?>
    --- FAIL: TestXML/<x_a="_a_____b_"/> (0.00s)
    xml_test.go:56:: 
            <x a=" a \n\r\t b "/>
            <x a=" a     b "
            <x a=" a     b "/>
    --- FAIL: TestXML/<x>&amp;&lt;&gt;</x> (0.00s)
    xml_test.go:56:: 
            <x>&amp;&lt;&gt;</x>
            <x&amp;&lt;></x>
            <x>&amp;&lt;></x>
    --- FAIL: TestXML/<!doctype_html> (0.00s)
    xml_test.go:56:: 
            <!doctype html>
            <!doctype html=
            <!doctype html=>
    --- FAIL: TestXML/<x>_<!--y-->_</x> (0.00s)
    xml_test.go:56:: 
            <x>\n<!--y-->\n</x>
            <x</x>
            <x></x>
    --- FAIL: TestXML/<style>lala{color:red}</style> (0.00s)
    xml_test.go:56:: 
            <style>lala{color:red}</style>
            <stylelala{color:red}</style>
            <style>lala{color:red}</style>
--- FAIL: TestXMLKeepWhitespace (0.00s)
    --- FAIL: TestXMLKeepWhitespace/_<div>_<i>_test_</i>_<b>_test_</b>_</div>_ (0.00s)
    xml_test.go:84:: 
             <div> <i> test </i> <b> test </b> </div> 
            <div <i test </i> <b test </b> </div>
            <div> <i> test </i> <b> test </b> </div>
    --- FAIL: TestXMLKeepWhitespace/<x>_<!--y-->_</x> (0.00s)
    xml_test.go:84:: 
            <x>\n<!--y-->\n</x>
            <x\n</x>
            <x>\n</x>
    --- FAIL: TestXMLKeepWhitespace/<style>lala{color:red}</style> (0.00s)
    xml_test.go:84:: 
            <style>lala{color:red}</style>
            <stylelala{color:red}</style>
            <style>lala{color:red}</style>
    --- FAIL: TestXMLKeepWhitespace/<x>_<?xml?>_</x> (0.00s)
    xml_test.go:84:: 
            <x> <?xml?> </x>
            <x<?xml </x>
            <x><?xml?> </x>
    --- FAIL: TestXMLKeepWhitespace/<x>_<![CDATA[_x_]]>_</x> (0.00s)
    xml_test.go:84:: 
            <x> <![CDATA[ x ]]> </x>
            <x x </x>
            <x> x </x>
    --- FAIL: TestXMLKeepWhitespace/<x>_<![CDATA[_<<<<<_]]>_</x> (0.00s)
    xml_test.go:84:: 
            <x> <![CDATA[ <<<<< ]]> </x>
            <x<![CDATA[ <<<<< ]]></x>
            <x><![CDATA[ <<<<< ]]></x>
FAIL
exit status 1
FAIL	github.com/tdewolff/minify/xml	0.004s

The first tag seems to be missing its closing >.

Improve perfect hashing

Quickly map []byte to numbers from 0 to N, for a pre-defined set of N strings. Make []byte => Hash fast. Move the Hasher code into this repository.

Hash the []byte, check if it is lower than N, if so verify that it is the same string using bytes.Equal. The latter could be skipped if we are sure there are no collisions below a certain length of input bytes (can we optimize the search for hash seeds for this?).

Promising hash functions are https://github.com/dgryski/go-metro, https://github.com/dgryski/go-farm, and https://github.com/cespare/xxhash. Especially farm seems to be fast for short strings (len < 50).

See http://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/ for comparisons.

Serialize does not serialize correctly

Given the input

@media print {
    .class {
         width: calc(10% - 5rem);
    }
}

calling serialize on the parsed *StylesheetNode returns

@media print {. class{width:calc(10%,-,5rem);}}

which adds a space between the class selector and the class name and inserts erroneous commas in the application of calc. (the space between the class selector and the class name only happens when nested in an at-rule, but the calc error happens regardless)

If I run the same input through your minifier, the class selector in media queries works as it should, but the calc error still happens.

Ability to get position in parsers

Hi,
Is there anyway to get the current position of the Json parser (and xml as well) when it returns an item?

I thought on creating ability to get the buffer lexer (since it has the "offset" method) or simply giving it as method on the parser? Like simple additional constructor that gets the buffer.lexer so that it's properties get exposed.

after an additional check - it seems that the way would be to either change the lexer to store the last position on shift and skip, or changing the parser itself to return the position whenever it returns the value.

thanks ahead.

[Xml Parser] recieved nil when calling lexer.Text() on TextToken

Reproduce with simple xml -

<?xml version="1.0"?>\r\t\n<session id="12345" user="1$2d#"><!DOCTYPE note>DUCK</xml>
when the parser reaches "DUCK" - calling lexer.Text() will return nil.

After short investigation - it might be since the text buffer is not being updated whenever TextToken is returned.

Is it intentional? i.e. feature not bug :)?

Malformed Go module

Hi, @tdewolff, I think I made a mistake in the #37. I didn't realize that the major version number of this package is no longer 1.

According to the Releasing Modules (v2 or Higher), the correct module path in the go.mod file should be:

 module github.com/tdewolff/parse/v2

I'm very sorry about this. I'll submit a new PR later.

add benchmark

add benchmarks, preferably comparing this to default encoding/xml

this is not necessary but would be a nice addition

JSON is parsed as valid Javascript

When using this library with the following code:

l := js.NewLexer(parse.NewInputBytes(data))
for {
	tt, _ := l.Next()
	if tt == js.ErrorToken {
		if l.Err() != io.EOF {
			return false, "The Javascript file is not valid", l.Err()
		} else {
			return true, "", nil
		}
	}
}

and the following input

{
    "elementId": "mJn9mpMQwIYyBe",
}

No error is detected. This returns true.

doco bug: example wrong for css parser

Hello!

great project!

I found a doco bug here at https://github.com/tdewolff/parse/blob/master/css/README.md#examples-1

it says:

func main() {
	// false because this is the content of an inline style attribute
	p := css.NewParser(bytes.NewBufferString("color: red;"), false)

The false should be true since it's inline. It's just prints nothing as it current is.

See:
https://github.com/tdewolff/parse/blob/master/css/parse.go#L95

regards

nickg

CSS: Custom Properties / Variables support

Would be nice to see css-variables parsed successfully, currently encountering a variable declaration results in ErrBadDeclaration.

Tags are breaking dep, the official golang dep tool

Hi !

Latest tag is v2.0.0 which is almost a year old.

I am now using dep (official golang dep tool) and when I import minify in one of my project before running...

dep init
dep ensure -update

It breaks with...

vendor/github.com/tdewolff/minify/html/html.go:194: undefined: html.Template

This is because with dep we can only specify versions of direct dependencies (minify in this case). We can't specify indirect dependencies versions (can't specify master for parse in this case). So the code goes out of sync.

Can you release tags for all the dependencies of the latest version of minify?

Thanks!

Finding class methods

First of all, thank you for an amazing package!

Question: My goal is to get all the method names from an ES class, which I was able to do but with reflect.
Is there some better way than for example doing the following:

package main

import (
	"fmt"
	"io/ioutil"

	"github.com/tdewolff/parse/v2"
	"github.com/tdewolff/parse/v2/js"
)

func main() {
	jsStr, err := ioutil.ReadFile("testClass.js")

	if err != nil {
		fmt.Println(err)
		return
	}

	ast, err := js.Parse(parse.NewInputBytes(jsStr))
	if err != nil {
		fmt.Println(err)
		return
	}

	for _, v := range ast.BlockStmt.List {
		//fmt.Println(reflect.TypeOf(v))
		//fmt.Println(v)

		switch v.(type) {
		case *js.ExportStmt:
			exprt := v.(*js.ExportStmt)

			//check if ClassDecl
			classDecl, ok := exprt.Decl.(*js.ClassDecl)
			if ok == false {
				continue
			}

			for _, v := range classDecl.Methods {
                                 //method's name
				fmt.Println(v.Name)
			}
		}
	}
}

Does js parser support javascript react?

Hello, I am seeking a go package which could parse .jsx file. Does js parser support jsx files?

Use hasher to generate hashes for strings that are compared

Use the hasher for CSS like we use it for HTML to speed up comparison of byte arrays. Benchmark tests e0ec019 don't lie!

API was broken in a micro release

Both 24e08b0 and c6bbc7c remove symbols in a micro release (2.3.10 and 2.3.11). According to semver, these can only be removed in a major release (3.0.0).

Employ concurrency

Use concurrency between the tokenizer and parser to enhance performance. I suspect this might be slower for small CSS files (and thus for inline style html attributes).

AST does not model CSS functions correctly

The AST cannot model a function with whitespace-delimited arguments.

Some examples are calc, "to left top" in linear-gradient, or something like attr(data-length px).

There needs to be an extra level like Args for comma-delimited arguments, each of which is a slice of white-space delimited tokens, so that attr(id) would be something like []CD{[]WD{"id"}} and linear-gradient(to left top, black) would be

[]CD{
    []WD{"to", "left", "top"},
    []WD{"black"},
}

I'm not 100% sure that would cover all cases, but it would handle all the ones I can think of, at least.

fuzzer found crash

Was fuzzing my code which uses the CSS parser, and it found a case where it hangs. The input was:

"\xd0\xfe[\xe7\x82"

The output:

program hanged (timeout 10 seconds)

SIGABRT: abortPC=0x4bf769 m=0 sigcode=0

goroutine 1 [running]:
github.com/tdewolff/parse/v2/buffer.(*Lexer).PeekErr(...)
        /home/naitik/usr/go/pkg/mod/github.com/tdewolff/parse/[email protected]/buffer/lexer.go:89
github.com/tdewolff/parse/v2/buffer.(*Lexer).Err(...)
        /home/naitik/usr/go/pkg/mod/github.com/tdewolff/parse/[email protected]/buffer/lexer.go:84
github.com/tdewolff/parse/v2/css.(*Lexer).Next(0xc000106d20, 0xc000000000, 0x0, 0x0, 0x0)
        /home/naitik/usr/go/pkg/mod/github.com/tdewolff/parse/[email protected]/css/lex.go:242 +0x459 fp=0xc000106bb8 sp=0xc000106b98 pc=0x4bf769
github.com/daaku/cssdalek/internal/cssselector.Parse(0x51ab60, 0xc000182bd0, 0xc000106e98, 0x46cff6, 0x5f100007, 0x15d9d523, 0xa452f4e01bb6)
        /home/naitik/workspace/cssdalek/internal/cssselector/cssselector.go:141 +0x179f fp=0xc000106e60 sp=0xc000106bb8 pc=0x4c77df
github.com/daaku/cssdalek/internal/cssselector.Fuzz(0x7efc536a7000, 0x5, 0x5, 0x3)
        /home/naitik/workspace/cssdalek/internal/cssselector/cssselector.go:191 +0xae fp=0xc000106ea8 sp=0xc000106e60 pc=0x4c817e
go-fuzz-dep.Main(0xc000106f70, 0x1, 0x1)
        go-fuzz-dep/main.go:36 +0x1ad fp=0xc000106f58 sp=0xc000106ea8 pc=0x470ced
main.main()
        github.com/daaku/cssdalek/internal/cssselector/go.fuzz.main/main.go:15 +0x52 fp=0xc000106f88 sp=0xc000106f58 pc=0x4c8332
runtime.main()
        runtime/proc.go:203 +0x1fa fp=0xc000106fe0 sp=0xc000106f88 pc=0x43228a
runtime.goexit()
        runtime/asm_amd64.s:1373 +0x1 fp=0xc000106fe8 sp=0xc000106fe0 pc=0x45c921

SASS support

Are you planning to support sass/scss parsing in the future?

JS containing HTML comments

Parse HTML comments inside JS properly according to https://people.mozilla.org/~jorendorff/es6-draft.html#sec-html-like-comments

build error

Hello,I wonder where's the "v2" directory? I just built the package ,and there's an error such as this:
root@default:/go/src/github.com/tdewolff/parse# go build
error.go:7:2: cannot find package "github.com/tdewolff/parse/v2/buffer" in any of:
/usr/local/go/src/github.com/tdewolff/parse/v2/buffer (from $GOROOT)
/go/src/github.com/tdewolff/parse/v2/buffer (from $GOPATH)

and then I tried to pull the v2 directory with git,but there's another error:
package github.com/tdewolff/parse/v2: cannot find package "github.com/tdewolff/parse/v2" in any of:
/usr/local/go/src/github.com/tdewolff/parse/v2 (from $GOROOT)
/go/src/github.com/tdewolff/parse/v2 (from $GOPATH)

How can I fix this error?

Make the parser streaming

The use-case for this package is often big files that need more performance than the non-lexing alternatives.
The current implementation reads the full file at once, which is not ideal for this situation.

cannot find package "github.com/tdewolff/parse/v2/buffer"

I'm upgrading minify to the latest one but seems like it uses github.com/tdewolff/parse/v2/buffer which doesn't exist in the repo.

any solution how to solve this?

Thanks

CSS package readme is out of date

https://github.com/tdewolff/parse/blob/master/css/README.md

I'm looking at stylesheet, err := css.Parse(r) and I don't see Parse in css/parse.go. I see NewParser though. Also, []Nodes doesn't seem to exist anymore, so for _, node := range stylesheet.Nodes doesn't work.

Just making sure it's not only me.

feature request: expose line number

Hi there,

I'm using the parser to check validity of something and I'd like to know where an issue occurred.

Let's use the CSS parser as an example. (This is the Lexer, but the line count could be exposed in the parser as well).

Looks like the area would be in here:

// Next returns the next Token. It returns ErrorToken when an error was encountered. Using Err() one can retrieve the error message.
func (l *Lexer) Next() (TokenType, []byte) {
	switch l.r.Peek(0) {
	case ' ', '\t', '\n', '\r', '\f':
		l.r.Move(1)
		for l.consumeWhitespace() {
		}
		return WhitespaceToken, l.r.Shift()

It would seem the lever could keep a lineNum int and increment here (maybe with changes to "consumeWhitespace" as well).

thoughts?

regards,

Planned improvements

Improvements are being implemented in the lexer-refactor branch, but this will unlikely be continued due to preliminary tests showing almost insignificant performance gain.

~~Add parsers that take []byte, this is faster (see #38)~~
Ensure that appending a NULL byte for lexers is not a performance issue for general use; it seems that bytes.Buffer, os.Open and ioutil.ReadAll leave room in the capacity
Decouple Restore from the Lexer. You can only create buffers from an io.Reader (which is read out entirely and one terminating NULL byte is reserved) or an []byte, but the latter must already terminate in NULL or get one appended. When there is capacity to append, it will overwrite the underlying buffer. If this is not desirable, the caller will need to use Restorer and after parsing call .Restore() himself to restore that overwritten byte
~~Remove z.err for Lexer? Seems better to return through NewLexer* functions~~
Make Position also use unicode newlines?
Better error reporting (using new Error structure), report all NULLs in JSON and XML
Ensure that encountering NULL from the lexer does not always mean an error! ~~CSS~~, ~~HTML~~, JS, JSON, XML
~~Remove buffer.Reader when not needed anymore in minify?~~
Experiment with SIMD instructions to parse 32 bytes at a time, see https://medium.com/@rajat_sriv/parsing-gigabytes-of-json-per-second-9c5a1a7b91db and https://arxiv.org/abs/1902.08318

Incorrect ASI behaviour for templates

I've been debugging a minification issue that got reduced into basically:

`template`
whatever

This gets squashed into one line:

`template`whatever

Although ASI should apply just like to regular strings, and they should be kept on separate lines.

CSS: stream parser

Stream out nodes from the parser to allow infinitely sized CSS files. This also prevents ioutil.ReadAll which should be avoided anyways! Returning a stream of nodes just from Stylesheet node might not be sufficient, because a Ruleset node or AtRule node might be very large (and defy the streaming feature).

But flattening the entire parse-tree is hard because we somehow need to return values from which someone could build a parse-tree (if one were to fully loop through the stream). So we need to return things like BeginBlock and EndBlock and units like Declaration, SelectionRules, AtRule (without the block) or Token. SelectionRules is always followed by a block and must be ended with EndBlock and an AtRule just might be followed by a block.

The units used by the minifier are the Declaration and SelectionRules, anything else can be broken up for streaming (blocks mainly).

I have no clue how to work this out at the moment, ideas are appreciated!

err:expected instead of get in method definition

你好!
下面的代码获取AST出现错误

class Rectangle {

    // userName   TODO ERROR

    // constructor
    constructor(height, width) {
        this.height = height;
        this.width = width;

        this.userName = "hello"
    }

    userName // TODO ERROR

    // Getter
    get area() {
        return this.calcArea()
    }

    // Method
    calcArea() {
        return this.height * this.width;
    }
}

golang 代码

	jsStr, err := ioutil.ReadFile("./testfile/class.js")

	if err != nil {
		fmt.Printf("ERROR %s", err)
	}

	ast, err := js.Parse(parse.NewInputBytes(jsStr))

出现错误:

err:expected ( instead of get in method definition on line 19 and column 5
19: get area() {
^/nast:Decl(class Rectangle Method(constructor Params(Binding(height), Binding(width)) Stmt({ Stmt((this.height)=height) Stmt((this.width)=width) Stmt((this.userName)="hello") })) Method(userName Params() Stmt({ })))

Experiment with SIMD instructions to parse 32 bytes at a time

See: https://medium.com/@rajat_sriv/parsing-gigabytes-of-json-per-second-9c5a1a7b91db
Paper: https://arxiv.org/abs/1902.08318

This could potentially speed up ReplaceWhitespace or other small utility functions that consume a lot of time.

Typescript and jsx,tsx support

Is it possible to add support if not supported yet?

EqualFold from util.go has weird condition

Given the implementation:

// EqualFold returns true when s matches case-insensitively the targetLower (which must be lowercase).
func EqualFold(s, targetLower []byte) bool {
	if len(s) != len(targetLower) {
		return false
	}
	for i, c := range targetLower {
		if s[i] != c && (c < 'A' && c > 'Z' || s[i]+('a'-'A') != c) {
			return false
		}
	}
	return true
}

Interesting parts:

// In this line:
if s[i] != c && (c < 'A' && c > 'Z' || s[i]+('a'-'A') != c) {
// This expression:
c < 'A' && c > 'Z'
// ...it's always false.

Found using gocritic linter badCond check.

\` prematurely ends template string

Looks like \` ends template string even though backquote is escaped.

Input:

s = `\` `

Output:

s=`\``

err:identifier n has already been declared

使用下面的代码执行获取AST会出现错误.

golang 代码:

	ast, err := js.Parse(parse.NewInputBytes(jsStr))

JavaScript代码

const n = 100


let fi = [1, 2, 3, 4], r = []
for (let t = 0, n = fi.length; t < n; t++) {
    const n = fi[t];
    r.push(`${n}:${n + 1};`)
}
// if (n > 0) {
//     const n = 200;
//     r.push(`${n}:${n + 1};`)
// }
console.info(r)

出现了错误

err:identifier n has already been declared on line 6 and column 11
6: const n = fi[t];

First character dropped from input

Given the program

package main

import (
    "log"
    "os"

    "github.com/tdewolff/parse/css"
)

func main() {
    ss, err := css.Parse(os.Stdin)
    if err != nil {
        log.Fatalln(err)
    }
    ss.Serialize(os.Stdout)
}

and the input

div{}
div{}

I would expect

div{}div{}

but running

go run test.go <test.css

I get

iv{}iv{}

Inspecting the values it seems that the first byte of the input is always read as \n (given a multibyte utf-8 rune it will truncate the first byte but keep the rest)

edit: I fail at markdown

Panic: runtime error: slice bounds out of range

package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
    "log"

    "github.com/tdewolff/buffer"
    "github.com/tdewolff/minify"
    "github.com/tdewolff/minify/css"
    "github.com/tdewolff/minify/html"
    "github.com/tdewolff/minify/js"
    "github.com/tdewolff/minify/svg"
)

func main() {
    buff := new(bytes.Buffer)
    file := buffer.NewReader([]byte(`<!DOCTYPE html>
<html lang='en' ng-app="app" ng-cloak="ng-cloak">
    <head>
        <meta charset='utf-8'>
        <meta name='viewport' content='width=device-width, initial-scale=1.0, minimum-sacle=1.0, maximum-scale=1.0, user-scalable=no'>
        <meta name='description' content=''>
        <base href='/'>
        <title>Golang Angular Materail Todo App </title>
        <link rel='shortcut icon' href='/favicon.ico'>
        <link rel='stylesheet' href='/assets/style.min.css'>
        <script src="/assets/libs.min.js">
        </script>
        <script src="/assets/app.min.js">

        </script>
    </head>
    <body ng-controller="todos" layout="column">
        <script id="new_list" type="text/ng-template">
            <md-dialog aria-label='Add Todo List' ng-cloak="ng-cloak" flex="70">
                <form name='form' ng-submit="add_list(newlist)" novalidate="novalidate">
                    <md-toolbar>
                        <div class="md-toolbar-tools">
                            <h2>Add Todo List</h2><span flex="flex">                            </span>
                            <md-button class="md-icon-button" ng-click='cancel()'>
                                <md-icon>cancel</md-icon>
                            </md-button>
                        </div>
                    </md-toolbar>
                    <md-dialog-content flex="flex">
                        <div class="md-dialog-content">
                            <md-input-container class="md-block">
                                <input ng-model='list.Name' type="text" name="Name" placeholder='Name' required="required">
                                <div ng-messages="form.Name.$error" ng-if="form.Name.$invalid && form.Name.$touched">
                                    <p ng-message="required">Name is required.</p>
                                </div>
                            </md-input-container>
                            <md-input-container class="md-block">
                                <input ng-model='list.Color' type="color" name="Color" placeholder='Color' required="required">
                                <div ng-messages="form.Color.$error" ng-if="form.Color.$invalid && form.Color.$touched">
                                    <p ng-message="required">Color is required.</p>
                                </div>
                            </md-input-container>
                        </div>
                    </md-dialog-content>
                    <md-dialog-actions layout='row'><span flex=''>                        </span>
                    </md-dialog-actions>
                </form>
            </md-dialog>
        </script>
        <script id="new_todo" type="text/ng-template">
            <md-dialog aria-label='Add Todo List' ng-cloak="ng-cloak" flex="70">
                <form name='form' ng-submit="add_list(newlist)" novalidate="novalidate">
                    <md-toolbar>
                        <div class="md-toolbar-tools">
                            <h2>Add Todo Item</h2><span flex="flex">                            </span>
                            <md-button class="md-icon-button" ng-click='cancel()'>
                                <md-icon>cancel</md-icon>
                            </md-button>
                        </div>
                    </md-toolbar>
                    <md-dialog-content flex="flex">
                        <div class="md-dialog-content">
                            <md-input-container class="md-block">
                                <input ng-model='newlist' name="Name" placeholder='Name' required="required">
                                <div ng-messages="form.Name.$error" ng-if="form.Name.$invalid && form.Name.$touched">
                                    <p ng-message="required">Name is required.</p>
                                </div>
                            </md-input-container>
                        </div>
                    </md-dialog-content>
                    <md-dialog-actions layout='row'><span flex=''>                        </span>
                        <md-button class="md-primary" type="submit">Add List


                        </md-button>
                    </md-dialog-actions>
                </form>
            </md-dialog>
        </script>
        <md-toolbar>
            <div class="md-toolbar-tools">
                <h3>My Todo Lists</h3><span flex="flex">                </span>
                <md-button class="md-icon-button" aria-label='Add' ng-click='new_list($event)'>
                    <md-icon>add</md-icon>
                </md-button>
            </div>
        </md-toolbar>
        <md-content layout="row" flex="flex">
            <div class="md-padding list" layout="row" flex="flex" ng-repeat="(name, list) in todos">
                <div class="md-whiteframe-1dp" layout="column" flex="flex">
                    <md-toolbar flex="flex">
                        <div class="md-toolbar-tools">
                            <h3>{{name}}</h3><span flex="flex">                            </span>
                            <md-button class="md-icon-button">
                                <md-icon>add</md-icon>
                            </md-button>
                        </div>
                    </md-toolbar>
                    <md-content layout="column">
                        <md-list>
                            <md-list-item ng-repeat='todo in list'>
                                <md-checkbox></md-checkbox>
                                <p> {{ todo.Content }} </p>
                            </md-list-item>
                        </md-list>
                    </md-content>
                </div>
            </div>
        </md-content>
        <!-- At the end because we need the dom. -->

    </body>
</html>`))

    m := minify.New()
    m.Add("text/html", &html.Minifier{
        KeepDefaultAttrVals: true,
        KeepWhitespace:      true,
    })
    m.AddFunc("text/css", css.Minify)
    m.AddFunc("text/javascript", js.Minify)
    m.AddFunc("image/svg+xml", svg.Minify)

    err := m.Minify("text/html", buff, ioutil.NopCloser(file))
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("buff: %s", buff.Bytes())
}

Panics with the follow trace:

2016/01/10 19:11:19 RAWTAG text/ng-template
2016/01/10 19:11:19 RAWTAG  list'>

panic: runtime error: slice bounds out of range

goroutine 1 [running]:
github.com/tdewolff/parse.Mediatype(0xc8200acaa7, 0xf, 0x559, 0x0, 0x0, 0x0, 0x0)
    /home/ome/go/src/github.com/tdewolff/parse/common.go:222 +0x4bf
github.com/tdewolff/minify/html.(*Minifier).Minify(0xc82000aca0, 0xc8200125d0, 0x7ff5356e22b8, 0xc82001a2a0, 0x7ff5356e22e0, 0xc82000acb0, 0x0, 0x0, 0x0)
    /home/ome/go/src/github.com/tdewolff/minify/html/html.go:105 +0x4d26
github.com/tdewolff/minify.(*M).MinifyMimetype(0xc8200125d0, 0xc82000acc0, 0x9, 0x10, 0x7ff5356e22b8, 0xc82001a2a0, 0x7ff5356e22e0, 0xc82000acb0, 0x0, 0x0, ...)
    /home/ome/go/src/github.com/tdewolff/minify/minify.go:116 +0x13e
github.com/tdewolff/minify.(*M).Minify(0xc8200125d0, 0x59bdc0, 0x9, 0x7ff5356e22b8, 0xc82001a2a0, 0x7ff5356e22e0, 0xc82000acb0, 0x0, 0x0)
    /home/ome/go/src/github.com/tdewolff/minify/minify.go:107 +0xec
main.main()
    /home/ome/go/src/test/test.go:142 +0x669

Support Go module

Hi @tdewolff,

I think you misunderstood the solution I mentioned in #41. So I opened this issue.

In fact, we don't need to release v3 to allow both tdewolff/parse and tdewolff/minify to support the Go module. And the /v2 suffix is mandatory, not optional.

Please take a look at the changes I made in my fork (aofei/parse). I tested it and it works fine (for both v1 and v2).

The changes:

Create branch v1
Commit 4049502
Release v1.1.1 for the 4049502
Create branch v2
Commit 0975233
Release v2.3.5 for the 0975233

I learned from russross/blackfriday.

Confused explanation of buffer.Reader

Reader is a wrapper around a []byte that implements the io.Reader interface. It is a much thinner layer than bytes.Buffer provides and is therefore faster.

But bytes.Buffer does so much more. Your comparison should be to bytes.Reader!

The only meaningful difference between your implementation and bytes.Reader is that your implementation has a Bytes method. The bodies of Read methods are close enough to identical that there's likely no reason to not use the stdlib one. The extra methods on bytes.Reader shouldn't hurt.

bytes.Reader doesn't have Bytes, but you can extract the bytes from it by with WriteTo.

You could just switch to bytes.Reader, or at least fix the text to not talk about bytes.Buffer.

Question about structure

Forgive me if this is a bit to esoteric; so much so that it may not be useful discussion.

But I noticed that you have two calls in each of your formats [js, json, css, html, ..]

To parse package and buffer package.

However adding

type Lexer struct {
	r *buffer.Lexer
}

To the parse library appears to make it so all the format libraries need only a single import call to parse. And in most cases I would say it is better to allow interdependent access to the different parts of the library is best, but I don't see these being used separately. And parse.Lexer reads better than buffer.Lexer.

I can create a pull request showing exactly what I mean and you can decide if you think its worthwhile or not. Or you can close this topic, no hard feelings.

Thanks for the library. I have been working with parse/lexing more since I have decided to implement a full MRI Ruby in Go, and this has been helpful along with other resources in really understanding the process of parsing and lexing.

I noticed you have a stream lexer, but parsing does not appear to have the same functionality. As just practice, I can issue a pull request on that if you are interested.

First selector in media query sometimes parsed incorrectly.

@media print {
    .class {
         width: calc(10% - 5rem);
    }
}

is being parsed as:
(extraneous elements omitted, some declarations simplified for presentation)

&AtRuleNode{
    Block: &BlockNode{
         Nodes: []Node{
              {&TokenNode{Delim, "."}}, //should be in RuleSetNode below
              {&RuleSetNode{/*yada yada*/}},
         },
    },
}

instead of the "." going in the ruleset node. This also happens with [class] but not #id.

Looking at parseBlock, I would guess that parseAtRule isn't unshifting the first token appropriately in all cases, though I have not verified this.

Use jump tables instead of switch statements

If possible, this would speed up the parsers.

See:
golang/go#5496
golang/go#19791 (comment)
golang/go#15780
https://www.nextmovesoftware.com/technology/SwitchOptimization.pdf

See https://quasilyte.dev/blog/post/go-asm-dispatch-tables/ for an implementation

Minify breaks on JSON since parse v2.3.6?

@regisphilibert reports at gohugoio/hugo#6472 that hugo --minify breaks on JSON since 59.0, which was right after Hugo updated the versioned dependency on tdewolff/parse and tdewolff/minify. The error message is:

Building sites … ERROR 2019/11/01 16:57:44 parse error:1:1: unexpected character
    1: {"output":{"data":{"created":"2010-02-11T12:46:02Z","draf...
       ^
ERROR 2019/11/01 16:57:44 parse error:1:1: unexpected character
    1: {"output":{"data":{"created":"2010-03-08T16:25:25Z","draf...
       ^

which seems to point to a bug in parse's JSON parser calling valid JSON an error.

@regisphilibert later on produced a MWE for us to test on, but with a different error message:

Error: Error building site: failed to render pages: parse error:1:1: unexpected character '<'
    1: <!DOCTYPE html><html><head><title>http://example.org/article/</title><link rel="canonical" href="http://example.org/article/"/><meta name="robots" content="noindex"><meta charset="utf-8" /><meta http-equiv="refresh" content="0; url=http://example.org/article/" /></head></html>

which in turns seem to indicate a bug in Hugo passing HTML code as if it were JSON to Minify (and to parse)

Further tests revealed that commit 7763141 in tdewolff/parse
7763141 as the oldest version of parse in which I could trigger the error.

When you have time, please read gohugoio/hugo#6472 for more details. Unfortunately, I am still struggling to understand the code, and I have exceeded my quota for today for further investigation. So, I'd better turn to expert like you to determine whether there is any bug in parse or minify at all regarding this issue.

Many thanks for your help!

JS: tokenizing regular expressions

I noticed that properly tokenizing JS regular expressions are left up to the user with (*Lexer).RegExp(). This makes things difficult as there's no easy way to differentiate between a legit DivToken/DivEqToken and a regexp.

I'm trying to rewrite some key identifiers while tokenizing and keeping the JS itself intact. As far as I know, this isn't possible to do with ASTs, since they cannot be converted back to valid JS.

I tried storing previous tokens and checked whether they would create a context where regular expressions can exist.

For example:

1/2/3

The first DivEqToken would not start a new regexp as 1 is not a valid delimiter. Same goes for

(1)/2/3

This actually works in most cases. However, it gets more complicated with inline if statements and for loops:

// perfectly valid JavaScript and RegExp
if(true) /x/

Hopefully, I made my issue and intentions clear :)

Thanks!

CSS replacement character instead of NULL

CSS dictates that a 0x00 byte is to be replaced by U+FFFD

This is not per se needed, it can remain a 0x00 byte, but it mustn't see this byte as an error returned by Shifter.Peek, which returns 0x00 for errors. Explicit error checking is required.

v2.5.0 breaks signature of css.NewParser

It looks like in v2.4.4 css.NewParser was:

func NewParser(r io.Reader, isInline bool) *Parser {

But now in 2.5.0 is:

func NewParser(r *parse.Input, isInline bool) *Parser {

I have a bunch of existing code that depends on being able to pass an io.Reader for parsing. (And this change breaks semantic versioning rules.)

A couple alternative approaches:

Another method like NewInputParser() could accept a *parse.Input, leaving NewParser as it was before
Or NewParser() could retain the old signature and then inside the method do a type assertion to check if *parse.Input was provided and perform additional logic in that case.

Any thoughts or suggestions on this?

Example not working

I could make the following work

// true because this is the content of an inline style attribute
p := css.NewParser(bytes.NewBufferString("color: red;"), true)

by importing additionally
"github.com/tdewolff/parse/v2"

And replacing bytes.NewBufferString("color: red;") by

parse.NewInput(bytes.NewBufferString(minimalStyleSheet))

Thank you for this useful package, Stephan

JS: Error parsing incremental operator in comparison

I had some trouble with incremental operators and created this small PoC to demonstrate the issue:

package main

import (
        "fmt"

        "github.com/tdewolff/parse/js"
        "github.com/tdewolff/parse"
)

func main() {
        tests := []string{
                "x = 0; x++ === 1",     // works
                "x = 0; x++===1",       // does not work
                "x = 0; x++ === 1",     // works
                "x = 0; 1 ===x++",      // works
        }


        for _, t := range tests {
                _, err := js.Parse(parse.NewInputString(t))
                if err != nil {
                        fmt.Printf("Error while parsing '%s': %s\n", t, err)
                }
        }
}

env GO111MODULE=off go get github.com/tdewolff/parse/js
env GO111MODULE=off go get github.com/tdewolff/parse
env GO111MODULE=off go run main.go

Whitespaces between the equality operator appear to make a difference for some reason. Let me know if there's anything I can do to help :)!

JS: distinguish regexps from division correctly

Hello, good work on the minification / parsing libs.

There is an issue with lexing though. Those are rare cases, but they're still valid JavaScript. In JS lexer, you can't go away with one-token lookbehind to figure out whether / is a beginning of a regular expression or a division (in fact, this is what Crockford did wrong in JSMin / JSLint and many other implementations copied those bugs instead of following the spec).

Just few examples:

(a+b)/42/i  // <-- divide (a+b) by 42 and by i
if(a+b)/42/i  // <-- regular expression /42/i

function foo() {} /42/i  // <-- function declaration followed by a regex
var foo = function foo() {} /42/i  // <-- function expression divided by 42 and i

{console.log('a');} /42/i  // <-- block statement followed by a regex
({ valueOf: () => 100 }/42/i)  // <-- object expression divided by 42 and i

/* ... */

tdewolff / parse Goto Github PK

parse's Introduction

Parse

Buffer

Reader

Writer

Lexer

StreamLexer

Strconv

CSS

HTML

JS

JSON

SVG

XML

License

parse's People

Contributors

Stargazers

Watchers

Forkers

parse's Issues

Recommend Projects

Recommend Topics

Recommend Org