Giter Site home page Giter Site logo

scaws's Introduction

scaws

This project contains some components and extensions for using Scrapy on Amazon AWS.

Requirements

  • Scrapy 0.13 or above
  • boto 1.8 or above

Install

Download and run: python setup.py install

Available components

SimpleDB stats collector

Module: scaws.statscol

A Stats collector which persists stats to Amazon SimpleDB, using one SimpleDB item per scraping run (ie. it keeps history of all scraping runs). The data is persisted to the SimpleDB domain specified by the STATS_SDB_DOMAIN setting. The domain will be created if it doesn't exist.

In addition to the existing stats keys, the following keys are added at persitance time:

  • spider: the spider name (so you can use it later for querying stats for that spider)
  • timestamp: the timestamp when the stats were persisted

Both the spider and timestamp are used to generate the SimpleDB item name in order to avoid overwriting stats of previous scraping runs.

As required by SimpleDB, datetimes are stored in ISO 8601 format and numbers are zero-padded to 16 digits. Negative numbers are not currently supported.

This Stats Collector requires the boto library.

This Stats Collector can be configured through the following settings:

STATS_SDB_DOMAIN

Default: 'scrapy_stats'

A string containing the SimpleDB domain to use for collecting the stats.

STATS_SDB_ASYNC

Default: False

If True, communication with SimpleDB will be performed asynchronously. If False blocking IO will be used instead. This is the default as using asynchronous communication can result in the stats not being persisted if the Scrapy engine is shut down in the middle (for example, when you run only one spider in a process and then exit).

scaws's People

Contributors

mohsinhijazee avatar pablohoffman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.