Giter Site home page Giter Site logo

vmruiz / scrapy-jsonschema Goto Github PK

View Code? Open in Web Editor NEW

This project forked from scrapy-plugins/scrapy-jsonschema

0.0 2.0 0.0 23 KB

Scrapy schema validation pipeline and Item builder using JSON Schema

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scrapy-jsonschema's Introduction

scrapy-jsonschema

https://travis-ci.org/scrapy-plugins/scrapy-jsonschema.svg?branch=master

This plugin provides two features based on JSON Schema and the jsonschema Python library:

Installation

Install scrapy-jsonschema using pip:

$ pip install scrapy-jsonschema

Configuration

Add JsonSchemaValidatePipeline by including it in ITEM_PIPELINES in your settings.py file:

ITEM_PIPELINES = {
    ...
    'scrapy_jsonschema.JsonSchemaValidatePipeline': 100,
}

Here, priority 100 is just an example. Set its value depending on other pipelines you may have enabled already.

Usage

Let's assume that you are working with this JSON schema below, representing products each requiring a numeric ID, a name, and a non-negative price (this example is taken from JSON Schema website):

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for a product",
            "type": "integer"
        },
        "name": {
            "description": "Name of the product",
            "type": "string"
        },
        "price": {
            "type": "number",
            "minimum": 0,
            "exclusiveMinimum": true
        }
    },
    "required": ["id", "name", "price"]
}

You can define a scrapy.Item from this schema by subclassing scrapy_jsonschema.item.JsonSchemaItem, and setting a jsonschema class attribute set to the schema. This attribute should be a Python dict -- note that JSON's "true" became True below; you can use Python's json module to load a JSON Schema as string):

from scrapy_jsonschema.item import JsonSchemaItem


class ProductItem(JsonSchemaItem):
    jsonschema =     {
        "$schema": "http://json-schema.org/draft-04/schema#",
        "title": "Product",
        "description": "A product from Acme's catalog",
        "type": "object",
        "properties": {
            "id": {
                "description": "The unique identifier for a product",
                "type": "integer"
            },
            "name": {
                "description": "Name of the product",
                "type": "string"
            },
            "price": {
                "type": "number",
                "minimum": 0,
                "exclusiveMinimum": True
            }
        },
        "required": ["id", "name", "price"]
    }

You can then use this item class as any regular Scrapy item (notice how fields that are not in the schema raise errors when assigned):

>>> item = ProductItem()
>>> item['foo'] = 3
(...)
KeyError: 'ProductItem does not support field: foo'

>>> item['name'] = 'Some name'
>>> item['name']
'Some name'

If you use this item definition in a spider and if the pipeline is enabled, generated items that do no follow the schema will be dropped. In the (unrealistic) example spider below, one of the items only contains the "name", and "id" and "price" are missing:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/']

    def parse(self, response):
        yield ProductItem({
            "name": response.css('title::text').extract_first()
        })

        yield ProductItem({
            "id": 1,
            "name": response.css('title::text').extract_first(),
            "price": 9.99
        })

When running this spider, when the item with missing fields is output, you should see these lines appear in the logs:

2017-01-20 12:34:23 [scrapy.core.scraper] WARNING: Dropped: schema validation failed:
 id: 'id' is a required property
price: 'price' is a required property

{'name': u'Example Domain'}

The second item conforms to the schema so it appears as a regular item log:

2017-01-20 12:34:23 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/>
{'id': 1, 'name': u'Example Domain', 'price': 9.99}

The item pipeline also updates Scrapy stats with a few counters, under jsonschema/ namespace:

2017-01-20 12:34:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{...
 'item_dropped_count': 1,
 'item_dropped_reasons_count/DropItem': 1,
 'item_scraped_count': 1,
 'jsonschema/errors/id': 1,
 'jsonschema/errors/price': 1,
 ...}
2017-01-20 12:34:23 [scrapy.core.engine] INFO: Spider closed (finished)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.