scrapy-plugins / scrapy-jsonschema Goto Github PK
View Code? Open in Web Editor NEWScrapy schema validation pipeline and Item builder using JSON Schema
License: BSD 3-Clause "New" or "Revised" License
Scrapy schema validation pipeline and Item builder using JSON Schema
License: BSD 3-Clause "New" or "Revised" License
JsonSchemaItem
is a subclass of scrapy.item.DictItem
while a recent enough HubstorageExtension
checks whether an item is a scrapy.Item
(which is a subclass of DictItem
too): scrapinghub/scrapinghub-entrypoint-scrapy@52e5362
Version being used: 2a1d434
Sample code to test with:
# coding: utf8
from scrapy_jsonschema import JsonSchemaItem, JsonSchemaValidatePipeline
class TestItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-04/schema",
"type": "object",
"properties": {
"test": {
"type": "array",
"items": {
"type": "string",
},
},
},
}
def test():
item = TestItem()
item['test'] = [1]
stats = type('', (), {})
stats.inc_value = lambda x: x
pipeline = JsonSchemaValidatePipeline(stats)
pipeline.process_item(item, None)
if __name__ == '__main__':
test()
Expected result: An exception DropItem
raised with message "schema validation failed: xxx"
.
Actual result:
Traceback (most recent call last):
File "test.py", line 32, in <module>
test()
File "test.py", line 27, in test
pipeline.process_item(item, None)
File "/home/pengyu/temp/virtualenv/lib/python2.7/site-packages/scrapy_jsonschema/pipeline.py", line 33, in process_item
path = '.'.join(absolute_path)
TypeError: sequence item 1: expected string, int found
I noticed a bug for array of objects in JSON schemas, if I accidentally misspell a property, or add a property that is not yet in the schema to my item, it will still pass validation i.e.
{
"type":"object",
"properties":{
"contacts":{
"type":"array",
"items":{
"type":"object",
"properties":{
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"phone":{
"type":"string"
},
"email":{
"type":"string"
}
}
}
}
}
it will validate contacts is there and not misspelt, but if I accidentally put "telephone" instead of "phone" (or a similar mistake), within the array of objects, it will be considered valid.
Please let me know if this is intended, or if there is an available workaround.
Steps to Reproduce:
scrapy==2.2.0
scrapy_jsonschema=0.3.0
import scrapy_jsonschema
It would error out like the one below:
@asadurski has pointed out that it was due to this commit from scrapy
: scrapy/scrapy@5256eae
It seems like it would make a lot more sense if only fields that don't match the schema are dropped instead of the entire item, or at least configurable to act that way.
If the dropping of a field causes the item to not have the required fields, then drop the entire item.
Any thoughts on this approach?
Scrapy logs the following:
ScrapyDeprecationWarning: scrapy.item.DictItem is deprecated, please use scrapy.item.Item instead
Currently, using JSON Schema items with dataclasses (or attrs) doesn't work. Here's quick exampe:
from dataclasses import dataclass
from typing import Optional
from scrapy_jsonschema.item import JsonSchemaItem
class BookSchemaItem(JsonSchemaItem):
jsonschema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Book",
"description": "A Book item extracted from books.toscrape.com",
"type": "object",
"properties": {
"url": {
"description": "Book's URL",
"type": "string",
"pattern": "^https?://[\\S]+$"
},
"title": {
"description": "Book's title",
"type": "string"
}
},
"required": ["url"]
}
@dataclass
class BookItem(BookSchemaItem):
url: str
title: Optional[str] = None
It's mostly because of how scrapy-jsonschema tries to define the fields in the item via https://github.com/scrapy-plugins/scrapy-jsonschema/blob/master/scrapy_jsonschema/item.py#L77-L79 which is different from how dataclasses
and attrs
create the fields.
We should have a better way of defining the JSON Schema inside dataclasses by re-writing some portions of the library.
The following code produces an error, but I feel this should absolutely be possible to do:
from scrapy_jsonschema.item import JsonSchemaItem
class LinktItem(JsonSchemaItem):
def __init__(self, jsonschema):
self.jsonschema = jsonschema
jsonschema = {
'properties': {
'url': {
'description': 'full URL to the item',
'type': 'string',
'format': 'url'
},
"sourceName": {
"description": "name of source website",
"type": "string"
}
},
'definitions': {
'non-empty-string': {
'type': 'string',
'minLength': 1
}
}
}
test = LinkItem(jsonschema)
print(test)
Error produced:
Traceback (most recent call last):
File "/workspace/gencrawl/spiders/test.py", line 4, in <module>
class LinktItem(JsonSchemaItem):
File "/usr/local/lib/python3.6/site-packages/scrapy_jsonschema/item.py", line 77, in __new__
'{} must contain "jsonschema" attribute'.format(cls.__name__)
ValueError: LinktItem must contain "jsonschema" attribute
As we see in init, LinkItem does contain jsonschema
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.