andrewstuart / goq Goto Github PK
View Code? Open in Web Editor NEWA declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library
Home Page: https://godoc.org/astuart.co/goq
License: MIT License
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library
Home Page: https://godoc.org/astuart.co/goq
License: MIT License
Yesterday, we had a crash that seems to come down to a data race / concurrent map access in goq
, so I ran our application after compiling it with the -race
flag. It seems that the library regularly creates race conditions:
WARNING: DATA RACE
Read at 0x00c0001c30e0 by goroutine 137:
runtime.mapaccess1_faststr()
/usr/local/go/src/runtime/map_faststr.go:12 +0x0
github.com/andrewstuart/goq.goqueryTag.valFunc()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:89 +0x85
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
github.com/andrewstuart/goq.unmarshalSlice()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
github.com/andrewstuart/goq.unmarshalStruct()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
github.com/andrewstuart/goq.UnmarshalSelection()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc
Previous write at 0x00c0001c30e0 by goroutine 44:
runtime.mapassign_faststr()
/usr/local/go/src/runtime/map_faststr.go:202 +0x0
github.com/andrewstuart/goq.goqueryTag.valFunc()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:115 +0x296
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
github.com/andrewstuart/goq.unmarshalSlice()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
github.com/andrewstuart/goq.unmarshalStruct()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
github.com/andrewstuart/goq.unmarshalByType()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
github.com/andrewstuart/goq.UnmarshalSelection()
/home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc
I'm currently investigating.
I'm having an issue with selectors, and in general they're hard to deal with because they are not documented, neither here nor in GoQuery.
I have this markup:
And I select :
type Categorie struct {
Text string `goquery:"a,text"`
Link string `goquery:"a,[href]"`
Sub []Categorie `goquery:"ul"`
}
type Menu struct {
Categorie []Categorie `goquery:".menu-l1-li-hld li"`
}
I would expect text and href from links to return 1 of each, but I have a weird result where text append every sub text together, but href doesn't. Is it an issue with the lib?
Thanks
E.g. 'main.page.Items[0xc42019e318]' (type int): a type conversion error occurred: strconv.ParseInt: parsing "": invalid syntax
when the real error should show the extra type info 'main.page.Items[0xc42019f048].Score' (type unknown: invalid value): a custom Unmarshaler implementation threw an error: strconv.ParseInt: parsing "": invalid syntax
Add a README to help users
Is there any reason why the goquery tags don't use pure CSS selectors? It seems some special rules are required (i.e. element selector followed by arbitrary comma-separated "value selectors."). The documentation doesn't even mention if the "selectors" are actually CSS(3) selectors. though from the source looks like that's the case.
I'm asking about this decision because just before to find your library I was thinking to develop something similar(i.e. a library that unmarshals pure css selectors(using cascadia) into /x/html.Node)
The panic stack trace is the following:
panic: runtime error: index out of range
goroutine 1 [running]:
astuart.co/goq.goqueryTag.preprocess(0x7d907c, 0x13, 0xc42025be30, 0x7)
/n/gopath/src/astuart.co/goq/unmarshal.go:40 +0x207
astuart.co/goq.unmarshalStruct(0xc42025be30, 0x8569c0, 0xc420184090, 0x199, 0x8569c0, 0x8569c0)
/n/gopath/src/astuart.co/goq/unmarshal.go:283 +0x1ac
astuart.co/goq.unmarshalByType(0xc42025be30, 0x7cbd60, 0xc420184090, 0x16, 0x7c1091, 0x9, 0x0, 0x0)
/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.unmarshalMap.func1(0x0, 0xc42025be30, 0x1)
/n/gopath/src/astuart.co/goq/unmarshal.go:405 +0x435
github.com/PuerkitoBio/goquery.(*Selection).EachWithBreak(0xc42025ba70, 0xc4204e3658, 0x7c1091)
/n/gopath/src/github.com/PuerkitoBio/goquery/iteration.go:21 +0x10b
astuart.co/goq.unmarshalMap(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0xc42000e028, 0x195)
/n/gopath/src/astuart.co/goq/unmarshal.go:390 +0x385
astuart.co/goq.unmarshalByType(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0x195, 0x7ffe20)
/n/gopath/src/astuart.co/goq/unmarshal.go:208 +0x6ab
astuart.co/goq.unmarshalStruct(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x80fea0, 0x80fea0)
/n/gopath/src/astuart.co/goq/unmarshal.go:289 +0x245
astuart.co/goq.unmarshalByType(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x0, 0x0, 0xc42000e028, 0x199)
/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.UnmarshalSelection(0xc42025ba40, 0x7cbda0, 0xc42000e028, 0x0, 0xc420400020)
/n/gopath/src/astuart.co/goq/unmarshal.go:180 +0x308
astuart.co/goq.(*Decoder).Decode(0xc420400020, 0x7cbda0, 0xc42000e028, 0xc420784000, 0x2ad98)
/n/gopath/src/astuart.co/goq/decoder.go:37 +0xc4
main.store(0xc42016e0c0)
/home/mester/twscrap/main.go:151 +0x1d3
main.startCollecting(0xc42015a280)
/home/mester/twscrap/main.go:106 +0x458
main.main()
/home/mester/twscrap/main.go:72 +0x1a4
exit status 2
The problem occurss with this structure type:
type T struct {
A string `goquery:",[second-id]"`
}
type A struct {
B map[string]T `goquery:"div.id,[div-id]"`
}
Hi,
I'm using goq and am very happy with it so far.
But now I want to extract CSS links from a <head></head>
section of a page and don't get it working.
Here's the HTML:
<!DOCTYPE html>
<html lang="de"
<head>
<link rel="stylesheet" type="text/css" href="https://foo.bar/blah1.css"/>
<link rel="stylesheet" type="text/css" href="https://foo.bar/blah2.css"/>
</head>
</html>
And the code i'm trying to use:
package main
import (
"log"
"os"
"astuart.co/goq"
)
type Site struct {
CSS []string `goquery:"head > link[type='text/css'],[href]"`
}
func main() {
fd, err := os.Open(os.Args[0])
if err != nil {
log.Fatalln(err)
}
s := &Site{}
if err = goq.NewDecoder(fd).Decode(&s); err != nil {
log.Fatalln(err)
}
log.Println(s)
}
The Site
struct obj is empty after execution.
However, if I use the cascadia testing cli, the selector works:
cascadia -i sample.html -o -c "head > link[type='text/css']" -p "Link=ATTR:href"
Link
https://foo.bar/blah1.css
https://foo.bar/blah2.css
I assume I have not "translated" the selector properly to goq. If that's the case, how would I do it correctly?
thanks in advance,
Tom
Command:
GO111MODULE=on go get github.com/andrewstuart/[email protected]
Outputs:
go: finding github.com v1.0.0
go: finding github.com/andrewstuart v1.0.0
go: finding github.com/andrewstuart/goq v1.0.0
go: downloading github.com/andrewstuart/goq v1.0.0
go: extracting github.com/andrewstuart/goq v1.0.0
go get: github.com/andrewstuart/[email protected]: parsing go.mod:
module declares its path as: astuart.co/goq
but was required as: github.com/andrewstuart/goq
I fixed it with:
replace (
github.com/andrewstuart/goq => astuart.co/goq v1.0.0
)
But maybe there is a way to fix it on side of repo?
This unmarshal-error.go
. Please change to unmarshal_error.go
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.