Giter Site home page Giter Site logo

getgo's Introduction

Getgo: a concurrent, simple and extensible web scraping framework

GoDoc Build Status

Getgo is a concurrent, simple and extensible web scraping framework written in Go.

Quick start

###Get Getgo

go get -u github.com/hailiang/getgo

###Define a task This example is under the examples/goblog directory. To use Getgo to scrap structured data from a web page, just define the structured data as a Go struct (golangBlogEntry), and define a corresponding task (golangBlogIndexTask).

type golangBlogEntry struct {
	Title string
	URL   string
	Tags  *string
}

type golangBlogIndexTask struct {
	// Variables in task URL, e.g. page number
}

func (t golangBlogIndexTask) Request() *http.Request {
	return getReq(`http://blog.golang.org/index`)
}

func (t golangBlogIndexTask) Handle(root *query.Node, s getgo.Storer) (err error) {
	root.Div(_Id("content")).Children(_Class("blogtitle")).For(func(item *query.Node) {
		title := item.Ahref().Text()
		url := item.Ahref().Href()
		tags := item.Span(_Class("tags")).Text()
		if url != nil && title != nil {
			store(&golangBlogEntry{Title: *title, URL: *url, Tags: tags}, s, &err)
		}
	})
	return
}

###Run the task Use util.Run to run the task and print all the result to standard output.

	util.Run(golangBlogIndexTask{})

To store the parsed result to a database, a storage backend satisfying getgo.Tx interface should be provided to the getgo.Run method.

Understand Getgo

A getgo.Task is an interface to represent an HTTP crawler task that provides an HTTP request and a method to handle the HTTP response.

type Task interface {
	Requester
	Handle(resp *http.Response) error
}

type Requester interface {
	Request() *http.Request
}

A getgo.Runner is responsible to run a getgo.Task. There are two concrete runners provided: SequentialRunner and ConcurrentRunner.

type Runner interface {
	Run(task Task) error // Run runs a task
	Close()              // Close closes the runner
}

A task that stores data into a storage backend should satisfy getgo.StorableTask interface.

type StorableTask interface {
	Requester
	Handle(resp *http.Response, s Storer) error
}

A storage backend is simply an object satisfying getgo.Tx interface.

type Storer interface {
	Store(v interface{}) error
}

type Tx interface {
	Storer
	Commit() error
	Rollback() error
}

See getgo.Run method to understand how a StorableTask is combined with a storage backend and adapted to become a normal Task to allow a Runner to run it.

There are currently a PostgreSQL storage backend provided by Getgo, and it is not hard to support more backends (See getgo/db package for details).

The easier way to define a task for an HTML page is to define a task satisfying getgo.HTMLTask rather than getgo.Task, there are adapters to convert internally an HTMLTask to a Task so that a Runner can run an HTMLTask. The Handle method of HTMLTask provides an already parsed HTML DOM object (by html-query package).

type HTMLTask interface {
	Requester
	Handle(root *query.Node, s Storer) error
}

Similarly, a task for retrieving a JSON page should satisfy getgo.TextTask interface. An io.Reader is provided to be decoded by the encoding/json package.

type TextTask interface {
	Requester
	Handle(r io.Reader, s Storer) error
}

getgo's People

Contributors

golint-fixer avatar h12w avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

golint-fixer

getgo's Issues

not compiled

when I tried your example; the response is

../github.com/hailiang/html-query/expr/auto_expr.go:8: import /Users/........./go/pkg/darwin_amd64/code.google.com/p/go.net/html/atom.a: object is [darwin amd64 go1.3.1 X:precisestack] expected [darwin amd64 go1.3.3 X:precisestack]

thank you;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.