Giter Site home page Giter Site logo

scrap's Introduction

scrap's People

Contributors

maddiem4 avatar

Stargazers

Rozifus avatar

Watchers

 avatar James Cloos avatar  avatar

scrap's Issues

More builtin bucket types

We have a few in the ABCC scraper that deserve to be in mainline, with full test suites and such. There are also complementary types that deserve to be created.

// Bucket that contains buckets - only returning true if all sub-buckets
// return true. Buckets are tested in order.
type AllBucket struct {
    Children []scrap.Bucket
}

// Bucket that contains buckets - only returning true if any sub-bucket
// returns true. Buckets are tested in order.
//
// Behavior is short-circuiting. A later bucket will not be checked if an
// earlier bucket succeeds. So while the first bucket will be tried every
// time, the last bucket may never be tried, depending on the results
// of earlier buckets.
type AnyBucket struct {
    Children []scrap.Bucket
}

// Bucket that rejects all URLs with a specific prefix, and
// accepts everything else.
type RejectPrefixBucket string

// Reject URLs that are exact matches
type RejectExactBucket string

// Reject URLs with a specific postfix
type RejectPostfixBucket string

Record redirects

Use the HTTP.Client.CheckRedirect pluggable callback to record redirect chain information, so that the RouteActions will be able to access that data.

For now, include this in ScraperRequest.Stats. At some point, though, we'll probably want a separate ServerResponse object, with a Parse function and raw response data, etc.

ServerResponse.Grep

Search the content of the response for the given pattern.

Should be compatible with the backup copy mechanism.

Support 401 Auth for all requests

I could implement a more fine-grained solution, but I think the bulk solution will be more than enough for now, and I'd rather not implement complexity that I'm not actually going to use.

Node.Text()

Get the text content of a node and its descendents.

Add referer info to ScraperRequest object

Frankly, it would be nice to have a generic request context provided by the queuer, which could be basic, or involve more detailed construction of the request to convey even more contextual information (which could also be useful for bucketing).

Important thing, right now, is referer info. So maybe just a struct with a Referer string and a Misc map[string]interface{}.

Change RouteAction signature to use ServerResponse

We should be handing our downstream code a more flexible object to work with, but still not far from current convenience. Something that does not parse HTML when the downstream code doesn't WANT to parse it.

func Handler(req ScraperRequest, resp ServerResponse) {
    root := resp.Parse() // Existing use case, consumes resp.Body to make Node
    title_element := root.Find("title")[0]
    req.Remarks.Printf("title = %s\n", title_element.Text())
    req.Remarks.Printf("Started %v, lasted %v\n",
        req.Stats.Started,
        req.Stats.Duration,
    )

    expected_bytes := resp.Response.ContentLength
    got_bytes := resp.BytesRead
    if got_bytes != expected_bytes {
        req.Remarks.Printf("Expected %d bytes, got %d\n",
            expected_bytes,
            got_bytes
        )
    }

    root.Find("a").Queue()
}

var my_regexp = regexp.MustCompile("foo")
func GreppyHandler(req ScraperRequest, resp ServerResponse) {
    data := ioutil.ReadAll(resp.Body) // Or, resp.ReadAll()
    matches := my_regexp.FindAll(data, -1)
    req.Remarks.Printf("Found %d matches\n", len(matches))
}

This also demonstrates some unrelated, but planned, API improvements (for example, Node.Text()).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.