Giter Site home page Giter Site logo

andrewstuart / goq Goto Github PK

View Code? Open in Web Editor NEW
253.0 9.0 20.0 101 KB

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

Home Page: https://godoc.org/astuart.co/goq

License: MIT License

Go 100.00%
golang unmarshall goquery selectors html unmarshaling unmarshaller selector struct scrape

goq's People

Contributors

andrewstuart avatar undefx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goq's Issues

race condition & crash

Yesterday, we had a crash that seems to come down to a data race / concurrent map access in goq, so I ran our application after compiling it with the -race flag. It seems that the library regularly creates race conditions:

WARNING: DATA RACE
Read at 0x00c0001c30e0 by goroutine 137:
  runtime.mapaccess1_faststr()
      /usr/local/go/src/runtime/map_faststr.go:12 +0x0
  github.com/andrewstuart/goq.goqueryTag.valFunc()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:89 +0x85
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
  github.com/andrewstuart/goq.unmarshalSlice()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
  github.com/andrewstuart/goq.unmarshalStruct()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
  github.com/andrewstuart/goq.UnmarshalSelection()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc

Previous write at 0x00c0001c30e0 by goroutine 44:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:202 +0x0
  github.com/andrewstuart/goq.goqueryTag.valFunc()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:115 +0x296
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:210 +0x416
  github.com/andrewstuart/goq.unmarshalSlice()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:332 +0x23f
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:204 +0x91a
  github.com/andrewstuart/goq.unmarshalStruct()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:289 +0x243
  github.com/andrewstuart/goq.unmarshalByType()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:202 +0x879
  github.com/andrewstuart/goq.UnmarshalSelection()
      /home/awishformore/.go/pkg/mod/github.com/andrewstuart/[email protected]/unmarshal.go:180 +0x4fc

I'm currently investigating.

Document selectors (issue?)

I'm having an issue with selectors, and in general they're hard to deal with because they are not documented, neither here nor in GoQuery.

I have this markup:

image

And I select :

type Categorie struct {
	Text string `goquery:"a,text"`
	Link string `goquery:"a,[href]"`
	Sub []Categorie `goquery:"ul"`
}

type Menu struct {
	Categorie []Categorie `goquery:".menu-l1-li-hld li"`
}

I would expect text and href from links to return 1 of each, but I have a weird result where text append every sub text together, but href doesn't. Is it an issue with the lib?

image

Thanks

Fix bug in error where path does not fully show for non-pointers

E.g. 'main.page.Items[0xc42019e318]' (type int): a type conversion error occurred: strconv.ParseInt: parsing "": invalid syntax

when the real error should show the extra type info 'main.page.Items[0xc42019f048].Score' (type unknown: invalid value): a custom Unmarshaler implementation threw an error: strconv.ParseInt: parsing "": invalid syntax

Readme

Add a README to help users

how does it compare with pure CSS selectors?

Is there any reason why the goquery tags don't use pure CSS selectors? It seems some special rules are required (i.e. element selector followed by arbitrary comma-separated "value selectors."). The documentation doesn't even mention if the "selectors" are actually CSS(3) selectors. though from the source looks like that's the case.

I'm asking about this decision because just before to find your library I was thinking to develop something similar(i.e. a library that unmarshals pure css selectors(using cascadia) into /x/html.Node)

Out of range panic in embeded map structures

The panic stack trace is the following:

panic: runtime error: index out of range

goroutine 1 [running]:
astuart.co/goq.goqueryTag.preprocess(0x7d907c, 0x13, 0xc42025be30, 0x7)
	/n/gopath/src/astuart.co/goq/unmarshal.go:40 +0x207
astuart.co/goq.unmarshalStruct(0xc42025be30, 0x8569c0, 0xc420184090, 0x199, 0x8569c0, 0x8569c0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:283 +0x1ac
astuart.co/goq.unmarshalByType(0xc42025be30, 0x7cbd60, 0xc420184090, 0x16, 0x7c1091, 0x9, 0x0, 0x0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.unmarshalMap.func1(0x0, 0xc42025be30, 0x1)
	/n/gopath/src/astuart.co/goq/unmarshal.go:405 +0x435
github.com/PuerkitoBio/goquery.(*Selection).EachWithBreak(0xc42025ba70, 0xc4204e3658, 0x7c1091)
	/n/gopath/src/github.com/PuerkitoBio/goquery/iteration.go:21 +0x10b
astuart.co/goq.unmarshalMap(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0xc42000e028, 0x195)
	/n/gopath/src/astuart.co/goq/unmarshal.go:390 +0x385
astuart.co/goq.unmarshalByType(0xc42025ba70, 0x7ffe20, 0xc42000e028, 0x195, 0x7c1091, 0x19, 0x195, 0x7ffe20)
	/n/gopath/src/astuart.co/goq/unmarshal.go:208 +0x6ab
astuart.co/goq.unmarshalStruct(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x80fea0, 0x80fea0)
	/n/gopath/src/astuart.co/goq/unmarshal.go:289 +0x245
astuart.co/goq.unmarshalByType(0xc42025ba40, 0x80fea0, 0xc42000e028, 0x199, 0x0, 0x0, 0xc42000e028, 0x199)
	/n/gopath/src/astuart.co/goq/unmarshal.go:202 +0x793
astuart.co/goq.UnmarshalSelection(0xc42025ba40, 0x7cbda0, 0xc42000e028, 0x0, 0xc420400020)
	/n/gopath/src/astuart.co/goq/unmarshal.go:180 +0x308
astuart.co/goq.(*Decoder).Decode(0xc420400020, 0x7cbda0, 0xc42000e028, 0xc420784000, 0x2ad98)
	/n/gopath/src/astuart.co/goq/decoder.go:37 +0xc4
main.store(0xc42016e0c0)
	/home/mester/twscrap/main.go:151 +0x1d3
main.startCollecting(0xc42015a280)
	/home/mester/twscrap/main.go:106 +0x458
main.main()
	/home/mester/twscrap/main.go:72 +0x1a4
exit status 2

The problem occurss with this structure type:

type T struct {
    A string `goquery:",[second-id]"`
}
type A struct {
    B map[string]T `goquery:"div.id,[div-id]"`
}

Selector mapping from goq => goquery => cascadia doesn't work as expected

Hi,

I'm using goq and am very happy with it so far.

But now I want to extract CSS links from a <head></head> section of a page and don't get it working.

Here's the HTML:

<!DOCTYPE html>
<html lang="de"
  <head>
    <link rel="stylesheet" type="text/css" href="https://foo.bar/blah1.css"/>
    <link rel="stylesheet" type="text/css" href="https://foo.bar/blah2.css"/>
  </head>
</html>

And the code i'm trying to use:

package main

import (
        "log"
        "os"

        "astuart.co/goq"
)

type Site struct {
        CSS []string `goquery:"head > link[type='text/css'],[href]"`
}

func main() {

        fd, err := os.Open(os.Args[0])
        if err != nil {
                log.Fatalln(err)
        }

        s := &Site{}
        if err = goq.NewDecoder(fd).Decode(&s); err != nil {
                log.Fatalln(err)
        }

        log.Println(s)
}

The Site struct obj is empty after execution.

However, if I use the cascadia testing cli, the selector works:

cascadia -i sample.html -o -c "head > link[type='text/css']" -p "Link=ATTR:href"
Link
https://foo.bar/blah1.css
https://foo.bar/blah2.css

I assume I have not "translated" the selector properly to goq. If that's the case, how would I do it correctly?

thanks in advance,
Tom

Module declares its path as: astuart.co/goq

Command:

GO111MODULE=on go get github.com/andrewstuart/[email protected]

Outputs:

go: finding github.com v1.0.0
go: finding github.com/andrewstuart v1.0.0
go: finding github.com/andrewstuart/goq v1.0.0
go: downloading github.com/andrewstuart/goq v1.0.0
go: extracting github.com/andrewstuart/goq v1.0.0
go get: github.com/andrewstuart/[email protected]: parsing go.mod:
	module declares its path as: astuart.co/goq
	        but was required as: github.com/andrewstuart/goq

I fixed it with:

replace (
  github.com/andrewstuart/goq => astuart.co/goq v1.0.0
)

But maybe there is a way to fix it on side of repo?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.