Giter Site home page Giter Site logo

scrapy-jsonschema's People

Contributors

burnzz avatar elacuesta avatar gallaecio avatar redapple avatar starrify avatar vmruiz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrapy-jsonschema's Issues

Incorrectly handling error path in the pipeline when handling arrays.

Version being used: 2a1d434
Sample code to test with:

# coding: utf8

from scrapy_jsonschema import JsonSchemaItem, JsonSchemaValidatePipeline


class TestItem(JsonSchemaItem):
    jsonschema = {
        "$schema": "http://json-schema.org/draft-04/schema",
        "type": "object",
        "properties": {
            "test": {
                "type": "array",
                "items": {
                    "type": "string",
                },
            },
        },
    }


def test():
    item = TestItem()
    item['test'] = [1]
    stats = type('', (), {})
    stats.inc_value = lambda x: x
    pipeline = JsonSchemaValidatePipeline(stats)
    pipeline.process_item(item, None)


if __name__ == '__main__':
    test()

Expected result: An exception DropItem raised with message "schema validation failed: xxx".
Actual result:

Traceback (most recent call last):
  File "test.py", line 32, in <module>
    test()
  File "test.py", line 27, in test
    pipeline.process_item(item, None)
  File "/home/pengyu/temp/virtualenv/lib/python2.7/site-packages/scrapy_jsonschema/pipeline.py", line 33, in process_item
    path = '.'.join(absolute_path)
TypeError: sequence item 1: expected string, int found

schema doesn't check for misspelt/new fields within array of objects

I noticed a bug for array of objects in JSON schemas, if I accidentally misspell a property, or add a property that is not yet in the schema to my item, it will still pass validation i.e.

{
   "type":"object",
   "properties":{
      "contacts":{
         "type":"array",
         "items":{
            "type":"object",
            "properties":{
               "title":{
                  "type":"string"
               },
               "name":{
                  "type":"string"
               },
               "phone":{
                  "type":"string"
               },
               "email":{
                  "type":"string"
               }
            }
         }
      }
   }

it will validate contacts is there and not misspelt, but if I accidentally put "telephone" instead of "phone" (or a similar mistake), within the array of objects, it will be considered valid.

Please let me know if this is intended, or if there is an available workaround.

Drop non-conforming fields instead of whole items

It seems like it would make a lot more sense if only fields that don't match the schema are dropped instead of the entire item, or at least configurable to act that way.

If the dropping of a field causes the item to not have the required fields, then drop the entire item.

Any thoughts on this approach?

Support item definitions using dataclasses

Currently, using JSON Schema items with dataclasses (or attrs) doesn't work. Here's quick exampe:

from dataclasses import dataclass
from typing import Optional

from scrapy_jsonschema.item import JsonSchemaItem 


class BookSchemaItem(JsonSchemaItem):
    jsonschema = {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "title": "Book",
        "description": "A Book item extracted from books.toscrape.com",
        "type": "object",
        "properties": {
            "url": {
                "description": "Book's URL",
                "type": "string",
                "pattern": "^https?://[\\S]+$"
            },
            "title": {
                "description": "Book's title",
                "type": "string"
            }
        },
        "required": ["url"]
    }


@dataclass
class BookItem(BookSchemaItem):
    url: str
    title: Optional[str] = None

It's mostly because of how scrapy-jsonschema tries to define the fields in the item via https://github.com/scrapy-plugins/scrapy-jsonschema/blob/master/scrapy_jsonschema/item.py#L77-L79 which is different from how dataclasses and attrs create the fields.

We should have a better way of defining the JSON Schema inside dataclasses by re-writing some portions of the library.

Unable to instantiate class at runtime

The following code produces an error, but I feel this should absolutely be possible to do:

from scrapy_jsonschema.item import JsonSchemaItem

class LinktItem(JsonSchemaItem):
    def __init__(self, jsonschema):
        self.jsonschema = jsonschema


jsonschema = {
    'properties': {
        'url': {
            'description': 'full URL to the item',
            'type': 'string',
            'format': 'url'
        },
        "sourceName": {
            "description": "name of source website",
            "type": "string"
        }
    },
    'definitions': {
        'non-empty-string': {
            'type': 'string',
                    'minLength': 1
        }
    }
}

test = LinkItem(jsonschema)
print(test)

Error produced:

Traceback (most recent call last):
  File "/workspace/gencrawl/spiders/test.py", line 4, in <module>
    class LinktItem(JsonSchemaItem):
  File "/usr/local/lib/python3.6/site-packages/scrapy_jsonschema/item.py", line 77, in __new__
    '{} must contain "jsonschema" attribute'.format(cls.__name__)
ValueError: LinktItem must contain "jsonschema" attribute

As we see in init, LinkItem does contain jsonschema

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.