scrapinghub / web-poet Goto Github PK

View Code? Open in Web Editor NEW

91.0 9.0 14.0 1.13 MB

Web scraping Page Objects core library

Home Page: https://web-poet.readthedocs.io/en/stable/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

web-scraping python page-objects hacktoberfest

web-poet's Introduction

Scrapinghub command line client

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

Python >= 3.6

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

Please note that if you are using Python < 3.6, you should pin shub to 2.13.0 or lower.

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Documentation

Documentation is available online via Read the Docs: https://shub.readthedocs.io/, or in the docs directory.

web-poet's People

Contributors

Stargazers

Watchers

Forkers

zanachka sibiryakov ejulio gc-ss bhuvanesh15 kiyky gallaecio wanderer163 wrar serhii73 sjdex ogiaquino bradley39e akshayphilar

web-poet's Issues

Support async providers

https://github.com/victor-torres/core-po/blob/4e05d3e98547490e4f073aad439109eb49cc7f18/core_po/builder.py#L46

Async providers are supported on scrapy-po using Twisted's inlineCallbacks and Scrapy's maybeDeferred_coro method.

How should we implement async support here?

cc @kmike

include more attributes on web_poet.page_inputs.ResponseData

This issue aims to discuss adding new fields in addition to the existing html and url as proposed by @gatufo:

cookies
headers
status_code

time freezing doesn't work properly with zyte-common-items

I tried to use scrapy-poet's savefixture together with Product item from zyte-common-items. The result:

meta.json:

{
    "frozen_time": "2023-01-31T17:25:54.362413+00:00"
}

output.json:

    "metadata": {
        "dateDownloaded": "2023-01-31T17:25:55Z",
        "probability": 1.0
    },

And then, comparison fails when running pytest, on a freshly generated fixture:

E         -  'metadata': {'dateDownloaded': '2023-01-31T17:25:55Z', 'probability': 1.0},
E         ?                                                    ^
E         +  'metadata': {'dateDownloaded': '2023-01-31T17:25:54Z', 'probability': 1.0},
E         ?                                                    ^

Seems to be some issue with rounding, but I'm not sure; haven't checked it. Another possibile explanation is that without freezing the time during test generation the time can change.

Implement RulesRegistry.resolve

There is RulesRegistry.page_cls_for_item that gives you the page object class to use given an output item class.

It would be nice to have a method that, in addition to supporting an input item class, can also support an input page object class, i.e. return the page object class itself, or an override if one exists in the rules registry.

Use default black options

Currently we override line length to be 120. I'm not sure what's the reason for this, and propose to use black defaults.

A way to run just one test case

I see three useful ways to run tests: by passing the fixtures directory / not passing anything, by passing the fixtures/<page object> directory and by passing the fixtures/<page object>/<test case> directory. Out of these, the last use case doesn't work, instead running all test cases in fixtures/<page object>, because collect_file_hook() always returns a WebPoetFile corresponding to fixtures/<page object>. On the other hand, if WebPoetFile is like a Python file with multiple test functions and one test case is like a single test function, then the correct usage would be fixtures/<page object>::<test case>? It currenty returns "directory argument cannot contain ::".

Guidelines for using item_cls_fields: True or False by default?

@gatufo has some concerns about the current guidelines:

#53 (comment)

For me when combining both Item and Page Objects, Item defines the fields that what you want to extract, and Page Objects become just providers of those fields. That's why I think the default value should be True here to allow (or encourage) this use by default.

Split documentation by target audience

At the moment we kind of mix documentation for page object writers and for framework writers. I think it would be best to split the two at the root of the documentation, so that the top-level sections of the table of contents become:

Getting started
Writing page objects
Writing frameworks
Reference

So, for example, the content under “Framework expectations” in the “Additional requests” topic would be split into a separate page, under “Writing frameworks”.

Allowing subfields to be extracted

Stemming from scrapinghub/scrapy-poet#111 where we'd want to implement the API in web-poet itself regarding extracting data from a subset of fields.

API

The main directives that we want to support are:

include_fields
exclude_fields

Using both directives together should not be allowed.

via page object instance

item = partial_item_from_page_obj(product_page, include_fields=["x", "y"])
print(item)  # ProductItem(x=1, y=2, z=None)
print(type(product_page))  # ProductPage

This API assumes we already have an instance of the page object with the appropriate response data in it. Moreover, the item class can be inferred from the page object definition:

class ProductPage(WebPage[ProductItem]):
    ...

Arguably, we could also support page object classes as long as the URL is provided for the response data to be downloaded by the configured downloader.

via item class

Conversely, we could also support directly asking for the item class instead of the page object as long as we have access to the ApplyRule to infer their relationships. Unlike the page object, a single item class could have relationships to multiple page objects, depending on the URL.

But this means that the response should still be downloaded and the downloader is configured.

item = partial_item_from_item_cls(
    ProductItem, include_fields=["x", "y"], url="https://example.com"
)

There are several combinations of scenarios for this type of API.

Page object setup

A. The page object has all fields using the `@field` decorator

This is quite straightforward to support since we can easily do:

from web_poet.fields import get_fields_dict
from web_poet.utils import ensure_awaitable


fields = get_fields_dict(page_obj)
item_dict = item_from_fields_sync(page_obj, item_cls=dict, skip_nonitem_fields=False)
item = page_obj.item_cls(
    **{
        name: await ensure_awaitable(item_dict[name])
        for name in item_dict
        if name in field_names
    }
)

We basically derive all the fields from the page object and call them one-by-one.

B. The page object doesn't use the `@field` decorator but solely utilizes the `.to_item()` method

Alternatively, the page object can be defined as:

class ProductPage(WebPage[ProductItem]):
    def to_item(self) -> ProductItem:
        return ProductItem(x=1, y=2, z=3)

The methodology mentioned in scenario A above won't work here since calling get_fields_dict(page_obj) would result in an empty dict.

Instead, we can simply call the page object's .to_item() method and just include/exclude the given fields from there.

C. The page object has some fields using the `@field` decorator while some fields are populated inside the `.to_item()` method

class ProductPage(WebPage[ProductItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> ProductItem:
        return ProductItem(x=self.x, y=2, z=3)

This scenario is much harder since calling get_fields_dict(page_obj) would result in a non-empty dict: {'x': FieldInfo(name='x', meta=None, out=None)}.

We could try to check if the page object has overridden the .to_item() method by something like page_obj.__class__.to_item == web_poet.ItemPage.to_item. However, we're also not sure if it has added any new fields at all or has simply overridden it to add some post-processing or validation operations. Either way, the resulting field value from the .to_item() method (if it's overridden) could be totally different than calling the field directly.

We could also detect this scenario whenever some fields specified in include_fields=[...] or exclude_fields=[...] are not present in get_fields_dict(page_obj). If so, we can simply call the .to_item() method and include/exclude fields from there.

However, it's a wasteful operation since some fields could be expensive (i.e. having additional requests) and that's why they want to be excluded in the first place. But then, they were still unintentionally called via the .to_item() method.

In this case, we'd rely on the page object developer to design their page objects well and ensure that our docs highlight this caveat.

But still, there's the question of how to handle fields specified in include_fields=[...] or exclude_fields=[...] that are not existing at all. Let's tackle this in the further sections below (Spoiler: it'd be great to not support scenario C).

Handling field presence

I. Some fields specified in `include_fields=[...]` are not existing

An example would be:

@attrs.define
class SomeItem:
    x: Optional[int] = None
    y: Optional[int] = None
    
class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    @field
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_field=["y", "z"])

For this case, we can simply ignore producing the z field value since the page object does not support it.

Moreover, if all of the given fields are not existing at all, partial_item_from_page_obj(some_page, include_fields=["z"]), an empty item would be returned.

Note that this is could be related to scenario C above and we have to be careful since a given field might be declared without using the @field decorator.

class SomePage(WebPage[SomeItem]):
    @field
    def x(self) -> int:
        return 1
        
    def to_item(self) -> SomeItem:
        return SomeItem(x=1, y=2)
        
partial_item_from_page_obj(some_page, include_fields=["y"])

Because of these types of scenarios, it'd be hard to fully trust deriving the fields from a page object via fields = get_fields_dict(page_obj).

SOLUTION 1: we can make it clear to our users via our docs that we will only call .to_item() if the page object explicitly doesn't use any @field decorators. This means that we won't be supporting scenario C at all.

II. Some fields specified in `exclude_fields=[..]` are not existing

The same case with scenario I where we can simply ignore non-existing fields.

However, it has the same problem about supporting .to_item() for scenario C, since there might be some fields that's using the @field decorator while the rest are produced via the .to_item() method.

To err on the side of caution, it could simply call .to_item() and then removing the fields declared in exclude_fields=[...]. Better yet, go with SOLUTION 1 above as well.

III. No fields were given in `include_fields=[...]`

For this, we could simply return an item with empty fields.

If any fields are required but are missing (i.e. None), we simply let Python error it out: TypeError: __init__() missing 1 required positional argument: ....

IV. No fields were given in `exclude_fields=[...]`

We could return the item with full fields, basically calling the .to_item().

Item setup

1. The item has all fields marked as `Optional`

There's no issue with this since including or excluding fields won't result into errors like TypeError: __init__() missing 1 required positional argument: ....

All of the above examples above has this item setup.

2. The item has fields marked as required

For example:

@attrs.define
class SomeItem:
    x: int
    y: int
    
class SomePage(ItemPage[SomeItem]):
    def x(self) -> int:
        return 1
        
    def y(self) -> int:
        return 2
        
partial_item_from_page_obj(some_page, include_fields=["x"])

Unlike in scenario 1, this results in TypeError: __init__() missing 1 required positional argument : ... since the y field is required.

One solution is to allow overriding the item class that the page object is returning which removes the required fields. Here's an example:

@attrs.define
class SomeSmallerItem:
    x: int
    
partial_item_from_page_obj(some_page, include_fields=["x"], item_cls=SomeSmallerItem)

The other API could be:

partial_item_from_item_cls(
    SomeItem, include_fields=["x"], url="https://example.com", replace_item_cls=SomeSmallerItem,
)

Summary

We only support page object setups of scenarios A and B while support for C is dropped.

This makes it easier for the item setups in scenario I and II. Scenario III and IV should work whatever the case may be.

The item setup for scenario 1 is straightforward while scenario 2 needs a way to replace/override the item that a page object returns with a smaller version.

IDE-friendly unit tests

Right now tests created for page objects aren't discoverable by pytest, therefore IDEs (e.g. VS Code) aren't aware of tests presence, though there might be other unit tests discoverable by IDE. This results in the next problems:

Even if you need to run tests just for a single netloc you've got to run either all tests or to specify a path to a given test, which is not very convinient if you work with few different netlocs at a time.
web-poet implicitly registers a pytest plugin which, sometimes, is not needed. Disabling is possible, but you've got to use something like -p no:web-poet. Which is again makes the whole process less natural for the IDE.
No standard way to run a single test for a single URL.
Overall, the whole approach to unit tests varies drastically from the usual way pytest tests are created and used.

Would be really nice to have IDE-friendly tests for page objects that are:

discoverable by pytest and IDE without the need to configure some low-level stuff.
possible to run and debug in IDE a single test for a single URL.
pytest plugin could be easily disabled if needed.

Returns doesn't work in some subclasses

Similar to zytedata/zyte-common-items#49, even though the code is different, it's also not recursive and fails on e.g. this:

def test_returns_inheritance() -> None:
    @attrs.define
    class MyItem:
        name: str

    class BasePage(ItemPage[MyItem]):
        @field
        def name(self):
            return "hello"

    MetadataT = TypeVar("MetadataT")

    class HasMetadata(Generic[MetadataT]):
        pass

    class DummyMetadata:
        pass

    class Page(BasePage, HasMetadata[DummyMetadata]):
        pass

    page = Page()
    assert page.item_cls is MyItem

Mechanism for Page Objects declaring how HttpResponse is acquired

For this discussion, we'll focus on the subclasses of web_poet.WebPage which requires the web_poet.HttpResponse as a dependency.

Problem

There are some scenarios where we might need to perform some operation or do an extra step so that a Page Object can properly acquire the right HttpResponse dependency.

For example, some websites may require an API token when requesting a page. How does the Page Object declare which token to use to acquire the HttpResponse? Could the Page Object somehow know how to retrieve the API Key from somewhere? Does it know how to acquire a fresh API Key when it stops working? This example could also apply to web pages having the specific need to use some request headers like cookies.

Another variation of the problem would be HttpResponses acquired using POST requests like in search forms. This means that a request body must be properly used, as well as the request headers reflecting the data contents properly (e.g. Content-Type: application/json).

Note that web_poet.PageParams exist which could hold the things needed by a Page Object like tokens or cookies. However, it's not applicable to our particular use case since those things would only be present when the Page Object is instantiated. Currently, web_poet.PageParams serves the purpose of providing extra data to the Page Object (e.g. max paginations, currency conversion value, etc) which affects how it parses the data. What we essentially need is the means to specifically build (or at least declare instructions) the HttpResponse dependency that is needed by a Page Object.

Status Quo

Currently, the problem could be solved in a way using scrapy and scrapy-poet. Here's an example:

# Module for Page Objects

import attrs
import web_poet

@attrs.define
class TokenPage(web_poet.WebPage):
    @property
    def token(self):
        return self.response.css("script::text").re_first(r'"token":"(.+?)",')

@attrs.define
class SearchApiPage(web_poet.WebPage):
    def to_item(self):
        return {
            "total_results": self.response.json().get("totalResults")
        }

# Module for the Scrapy Spider

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com/"]

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
        }
    }

    def parse(self, response, page: TokenPage):
        yield response.follow(
            "https://search-api.example.com/?q=somequery",
            self.parse_search_page,
            headers={"Authorization": f"Bearer {page.token}"},
        )

    def parse_search_page(self, response, page: SearchApiPage):
        return page.to_item()

In this example, we're ultimately interested in retrieving the total number of results on the search query https://search-api.example.com/?q=somequery. However, requesting such a page requires the authorization header bearing a particular token. The token is acquired when visiting any regular pages (i.e. not an API) and parsing it inside the HTML document.

Note that this example is the minimal example that we could have. The solution we must arrive should also be able to support cases when we want to:

Perform a POST request instead of GET for the search API,
Cache the token somewhere so we don't need to revisit the page and parse it again, or
Have a mechanism to invalidate the cache and retrieve a fresh set of tokens.

Objective

The solution presented above only works when the Page Objects are used in the context of Scrapy spiders. The spider is what binds the Page Objects together like building blocks in order to acquire the right response. The spider should also be aware about the sequence of Page Objects to use, as well as how to use the parsed field from a Page Object to feed into the next Page Object.

The source of the problems above would be that Page Objects doesn't have a way to provide instructions on how to build its dependencies in a generic way.

Possible Approaches

Approach 1

Page Objects could have an alternative constructor which contains the actual implementation about how to build its dependencies. For example, the SearchApiPage above could directly use TokenPage inside its alternative constructor to acquire the token needed for its Authorization header.

I'm not too fond of this idea since it puts a lot of emphasis on the Page Object to be able to determine how the fulfill the dependencies of the other Page Objects it needs to use. The Page Object class could get a lot more complex, de-emphasizing its very purpose of focusing on data extraction.

Approach 2

Use the provider mechanism of scrapy-poet.

This means that a provider would be created for the TokenPage so that it can be injected when another Page Objects ask for it in their constructor. However, this only applies when scrapy-poet is used. This makes the Page Objects not portable outside of its realm. Although other framework implementations could copy the approach.

Another downside is that the provider itself is very specific to the set of Page Objects it caters. When another bunch of Page Objects for other sites are introduced needing another variety of building instructions, the written providers could get more complex.

Lastly, this doesn't solve our problem of having the ability to determine how to acquire the HttpResponse. For example, do we need a GET or POST request for that? What are the headers necessary for requesting the HttpResponse? What's the request body?

Approach 3

Similar to web_poet.OverrideRule (API reference), there should be a similar structure to declare the instruction rules on how to build the dependencies of a Page Object. Frameworks implementing web-poet should properly read and abide to this. For example, scrapy-poet would need to update its providers (e.g. HttpResponseProvider) to read such rules.

The minimum things that we need from this instruction rule declaration are:

URL pattern rule — (instance of url_matcher.matcher.Patterns) To determine which URLs the instruction rule would apply. It could be the case that a single PO could handle different types of URLs where different instructions are needed.
Page Object — (cls) The PO we're providing instructions for.
Request Instructions — (dict) Contains the instructions about how the HttpResponse for the given PO is acquired.

From our initial Page Objects, the instruction rule declaration could be something like
(we could make the structure of this data a bit better.):

[   
    {
        "url_pattern": url_matcher.Patterns(include=["example.com"]),
        "page_object": TokenPage,
        "request": {
            "method": "GET",
        }
    },
    {
        "url_pattern": url_matcher.Patterns(include=["search-api.example.com"]),
        "page_object": SearchApiPage,
        "dependencies": [TokenPage],
        "request": {
            "method": "GET",
            "headers": {"Authorization": f"Bearer {TokenPage.token}"},
        }
    }
]

This means that frameworks implementing web-poet could read such instruction rules and know how to build the POs. The rules could be declared somewhere like in a configuration file or perhaps similar to how Overrides are handled. They could also be simply declared as class variables directly in the Page Object class itself.

There's also the potential to extend this rule declaration to possibly include some interactions between two or more POs. However, I'm not exactly sure if this is a common use case and it may cause the instructions to be more complex.

It's also not clear how this approach could cache the TokenPage. Perhaps that could be left to the implementing framework.

I'm also thinking if we should extend such instruction rules to serve non-HttpResponse dependencies. Although it might make the rule structure a bit more complex to serve generic use cases.

In any case, I believe this could be a good starting point in thinking how to solve the problem of declaring how HttpResponses of Page Objects are acquired since :

Page Objects should be independent of the instruction rules and don't know they exist.
- This makes the instruction rules completely optional (similar to Overrides).
Conversely, the instruction rules simply denote how to fulfill the HttpResponses for a given Page Object.
- Page Objects doesn't care how the HttpResponse was acquired at all. It is simply given. This enables Page Objects to focus on data extraction.

Fix mypy issues

Currently, a lot of the tests that we have don't have type annotations, which means the tests aren't scrutinized by mypy. Aside from annotating all of the test functions, we could set check_untyped_defs = true in the config.

In either way, we'd need to address all of the mypy issues that will be raised.

Proposal: Utility functions that interacts with the rules

Background

Following the acceptance of #27, developers could now use URL patterns to declare which Page Objects would work on specific URL patterns (reference code).

Problem

For large code bases, there might be hundreds of Page Objects which in turn could also result in hundreds of OverrideRule created using the @handle_urls annotation.

This could be unwieldy especially when they're spread out across multiple different subpackages and submodules within a Page Object Project. A project could utilize other Page Objects from other external packages, leading to a deeper roots.

Moreover, overlapping rules (e.g. POs improving on older POs) could add another layer of complexity. It should be immediately clear which PO would be executed according to URL pattern and priority.

Idea

There should be some sort of collection of utility functions that could interact with the List[OverrideRule] from the registry. Suppose that we have:

from web_poet import default_registry, consume_modules

consume_modules(my_page_objects, some_other_project, another_project)
rules = default_registry.get_overrides()

We could then have something like:

from web_poet import rule_match

# Explore which OverrideRules are matches a given URL.
rule_match.find(rules, url="https://example.com/product/electronics?id=123")
# Returns: [OverrideRule_1, OverrideRule_2, OverrideRule_3, OverrideRule_4]

# It could also narrow down the search
rule_match.find(rules, url="https://example.com/product/electronics?id=123", overridden=ProductPage)
# Returns: [OverrideRule_2, OverrideRule_4]

# Finding the rules for a given set of criteria could result in multiple OverrideRules.
# This could be POs improving on older POs which could also improve on other POs.

# However, what we would ultimately want is the Final rule that has the highest priority
rule_match.final(rules, url="https://example.com/product/electronics?id=123", overridden=ProductPage)
# Returns: OverrideRule_2

This could help lead in creating test suites in projects that utilize other Page Object projects:

assert ImprovedProductPage == rule_match.final(
    rules, "https://example.com/product/electronics?id=123", overridden=ProductPage
).use

Other Notes:

I see that the rule_match.find() is quite similar to how the PageObjectRegistry.search_override() method behaves (reference).
- Refactoring it to a function (instead of a method) could cover developer use cases wherein the List[OverrideRule] is not created by the default_registry (or some custom registry). For example, it could merely be a simple configuration file containing all of the List[OverrideRule] that is manually maintained.
- However, in any case, the rule_match.find() that is explored above aims to have an actual URL instead of a Pattern (which PageObjectRegistry.search_overrides() has)

ZyteItemAdapter usage in fixtures

As found in #134, if ZyteItemAdapter is added to ItemAdapter when the fixture is generated, like documented at https://zyte-common-items.readthedocs.io/en/latest/setup.html#configuration , the generated output.json will not contain fields that are empty, and when running the test you need to also add ZyteItemAdapter to ItemAdapter, e.g. via conftest.py. At the same time, we may want to actually emit empty fields.

introduce more pages in addition to WebPage: JsonWebPage, CsvWebPage, etc

This issue aims to discuss adding more page handlers as proposed by @gatufo.

Currently, we only have web_poet.pages.WebPage which supports selectors like css() and xpath() on HTML pages.

However, there are also other types of pages we could support:

JsonWebPage which could have jmespath queries
CsvWebPage which could have a "unified" way to read them (usually used as seed data inputs in spiders)
PlainTextWebPage which could also have a "unified" way to read newline-separated data (also usually use for seed URL inputs)

`_Url` to inherit from `str`

There was a previous discussion about this before in one of the PRs.

I'm re-opening this for tracking since this part of w3lib.util.to_unicode breaks: https://github.com/scrapy/w3lib/blob/master/w3lib/util.py#L46-L49

In particular, doing something like:

from scrapy.linkextractors import LinkExtractor

link_extractor = LinkExtractor()
link_extractor.extract_links(response)

where response is a web_poet.page_inputs.http.HttpResponse instance and not scrapy.http.Response.

The full stacktrace would be:

File "/usr/local/lib/python3.10/site-packages/scrapy/linkextractors/[lxmlhtml.py](http://lxmlhtml.py/)", line 239, in extract_links
    base_url = get_base_url(response)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/[response.py](http://response.py/)", line 27, in get_base_url
    _baseurl_cache[response] = html.get_base_url(
  File "/usr/local/lib/python3.10/site-packages/w3lib/[html.py](http://html.py/)", line 323, in get_base_url
    return safe_url_string(baseurl)
  File "/usr/local/lib/python3.10/site-packages/w3lib/[url.py](http://url.py/)", line 141, in safe_url_string
    decoded = to_unicode(url, encoding=encoding, errors="percentencode")
  File "/usr/local/lib/python3.10/site-packages/w3lib/[util.py](http://util.py/)", line 47, in to_unicode
    raise TypeError(
TypeError: to_unicode must receive bytes or str, got ResponseUrl

Other alternatives could be adjusting Scrapy code instead to cast str(response.url) for every use.

PEP 561 compliance

https://peps.python.org/pep-0561/ should be followed in order for the web-poet to properly work on mypy. Otherwise, the following error is raised: Skipping analyzing "web_poet": module is installed, but missing library stubs or py.typed marker.

Links pointing to same document

Hello 👋
Please ignore if I understood it wrong.
When I check the doc tutorial.rst,
links like from_ground_up, scrapy:topics-index are pointing to same tutorial.rst.
Should it point somewhere else or my understanding is wrong?

Consider removing the limitation of only having a single PO in the registry

Stemming off from the discussion in #84 (comment).

Document super().foo issues and workaround

https://stackoverflow.com/a/76654866/939364

Seen in a production page object.

Map HttpResponse.status in the example implementation

Update

web-poet/web_poet/example.py

Lines 35 to 41 in 35c67d6

    
           def _get_http_response(url: str) -> HttpResponse: 
        
               response = requests.get(url) 
        
               return HttpResponse( 
        
                   response.url, 
        
                   body=response.content, 
        
                   headers=response.headers, 
        
               )

to also map the response status code into the returned HttpResponse object.

Discussion for supporting Scrapy's LinkExtractor

One neat feature inside Scrapy is it's LinkExtractors functionality. We usually try to use this whenever we want links to be extracted inside a given page.

Inside web-poet, we can attempt to use it as:

from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage

class SomePage(ItemWebPage):

    def to_item(self):
        return {
            'links': LinkExtractor(
                allow=r'some-website.com/product/tt\d+/$'
                process_value=some_processor,
                restrict_xpaths=f'//div[@id="products"]//span',
            ).extract_links(self.response)  # expects a Scrapy Response instance
        }

The problem lies in the extract_links() method since it actually expects a Scrapy Response instance. On the current scope, we only have access to web-poet's ResponseData instead.

At the moment, we could simply re-work the logic to avoid using LinkExtractors altogether. However, there might be some cases wherein it's a much better option.

With this in mind, this issue attempts to be a starting point to open up these discussion points:

Given Scrapy's current roadmap on its LinkExtractors and web-poet being decoupled away from Scrapy itself, is it worth supporting LinkExtractors?
or instead, should we update LinkExtractors itself so it would be compatible with web-poet?

Typing information is lost when @field decorator is used

class MyPage(ItemPage):
    @field
    def name(self) -> str:
        return "123"


page = MyPage()
page.name  # type is Any, not str

`skip_nonitem_fields=True` doesn't work when the page object is an attrs class

Currently, this works fine:

import attrs
from web_poet import HttpResponse, Returns, ItemPage, field

@attrs.define
class BigItem:
    x: int
    y: int

class BigPage(ItemPage[BigItem]):
    @field
    def x(self):
        return 1

    @field
    def y(self):
        return 2

@attrs.define
class SmallItem:
    x: int

class SmallXPage(BigPage, Returns[SmallItem], skip_nonitem_fields=True):
    pass

page = SmallXPage()
item = await page.to_item()

print(page._skip_nonitem_fields)  # True
print(item)  # SmallItem(x=1)

However, if we define an attrs class to have some page dependencies, it doesn't work:

from web_poet import PageParams

@attrs.define
class SmallPage(BigPage, Returns[SmallItem], skip_nonitem_fields=True):
    params: PageParams

page = SmallPage(params=PageParams())
print(page._skip_nonitem_fields)  # False
item = await page.to_item()  # TypeError: __init__() got an unexpected keyword argument 'y'

From the examples above, this stems from page._skip_nonitem_fields being set to False when the page object is defined as an attrs class.

"Apply rules" instead of overrides

hey! I’m exploring how to make the following work:

from zyte_common_items import Product
from web_poet import ItemPage, handle_urls


@handle_urls("example.com")
class ExamplePage(ItemPage[Product]):
    # ...

In other words, how to make @handle_urls("example.com") work for an ItemPage[Product] subclass without a need to use instead_of in handle_urls, and without a need to use a base page object for instead_of.

I can see 2 main approaches here.

Approach 1: support is directly

In the example, handle_urls doesn't really define any override rule. Instead, we have a declaration that ExamplePage can return Product item for example.com page. This information should be enough to allow creation of a scrapy-poet provider for items:

def parse(self, response: DummyResponse, product: Product):
    # ...

We know the website, we know which item is needed, and can use Page Object registry to find a right page object, according to domain and priority.

To implement it, web-poet needs to be slightly refactored:

We should rename “overrides”, “override rule” to something like “apply rules”. This includes changes to class names, module names, method names, and changes to documentation as well. In addition to the old ApplyRule(for_patterns=..., use=ExampleProductPage, instead_of=GenericProductPage), it’d be possible to specify ApplyRule(for_patterns=..., use=ExampleProductPage, to_return=Product)
handle_urls decorator would pick up to_return=Product from the ItemPage[Product] automatically (probably unless instead_of argument is defined? but that's not clear, there are use cases for supporting both at the same time).

When implementing it, we should make sure that priorities work as intended. For example, it should be possible to define and configure a Page Object which provides Product instances using AutoExtract API, and have it behaving in a sensible way (enabled for all domains, custom Page Objects should take priority over this default one).

Approach 2: define standard generic Page Object classes

In this case, "override rules" stay override rules. There is a registry of {item_type: generic_page_object_class} defined somewhere, e.g. {Product: ProductPage}. Or, maybe, {Product: [ProductPage, AnotherGenericProductPage]}. There should be an API to extend and modify this registry. handle_urls looks up this registry, and picks instead_of argument based on it.

Pros/cons

Overall, I like an approach with items more, because it seems to provide a nicer API:

less indirection;
enables a use of items as dependencies in callbacks;
only requires standardization of item classes, not base page objects.

A risk here is that we may decide that standardized page objects are actually very important. For example, unlike items, they allow to extract only some of the fields. They may also provide some methods other than to_item which can be used in overrides.

A challenge with standard base classes is how much logic you put there. On one hand, there is a lot of useful logic which can be put in a base class, and which can save developer time. For example, default implementation for some of the attributes, or helper methods. But the more powerful your base class is, the more you need to assume about the implementation. So, it might be wise to have these "standard" page objects mostly as "markers", to ensure that a wide range of page objects is compatible with each other. But, if we do so, we still need to put the "powerful" logic somewhere.

If you have separate "marker" base class used for overrides, and feature-full base class used in a project, usage becomes less straightforward - you'd probably need to do something like this - the original challenge is not solved:

# your project:
@handle_urls("example.com", instead_of=ProductPage)
class ExamplePageObject(PowerfulProductPage):
    # ...

# generic spider:
def parse_product(self, response: DummyResponse, page: ProductPage):
    # ...

Alternative is

# your project:
@handle_urls("example.com")
class ExamplePageObject(PowerfulProductPage):
    # ...

# generic spider:
def parse_product(self, response: DummyResponse, page: PowerfulProductPage):
    # ...

But it defeats the purpose of defining a standard ProductPage which is not tied to a project-specific spider - generic spider above doesn't support pages which are just ProductPage subclasses, it needs its own, PowerfulProductPage subclasses. It also requires solving a case when a page objects which uses handle_urls is not a strict subclass of PowerfulProductPage.

With "items" approach it's not an issue - as soon as page objects return the same item class, they're considered compatible, there is no need for them to share exactly same base class, besides having ItemPage[ItemClass] or Returns[ItemClass] somewhere in hierarchy.

Not extracting all fields with Approach 1

The main cons I see with the Approach 1 is a case where not all item fields are required in a generic spider.

@handle_urls("example.com")
class ExamplePage(ItemPage[Product]):
    # ...


def parse(self, response: DummyResponse, product: Product):
    # only some product attributes are used, not all of them

There is no ProductPage anymore, and if we define it, it's not stated that ExamplePage can be used instead of ProductPage. There are some possible solutions to it, e.g. using typing.Annotated:

def parse(self, response: DummyResponse, product: Annotated[Product, pick_fields(["price", "name"])):
    # ...

There could be also a "reverse" registry, {ProductPage: Product}, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.

Using custom methods

Users might want to have a Page Object with some custom methods, and use it in a crawler:

class ProductListPage(ItemPage[ProductList]):
    def links(self):
        # example utility method
        return [self.paginationNext.url] + [product.url for product in self.products]

def parse(self, response: DummyResponse, page: ProductListPage):
    yield from scrapy.follow_all(page.links())
    # ...

This is compatible with both Approach 1 and Approach 2, if ProductListPage uses ProductList as a dependency:

@attrs.define
class ProductListPage(ItemPage[ProductList]):
    item: ProductList
    def links(self):
        # example utility method
        return [self.item.paginationNext.url] + [product.url for product in self.item.products]

    def to_item(self):
        return self.item

In this case, one can define POs for ProductList using Approach 1, and they will be used automatically to create item dependency. It's also possible to override ProductListPage itself, if it's desired to have a custom links implementation. So, both pages below would be applied properly:

# for foo.com we state that "FooProductListPage returns productList" 
@handle_urls("foo.com")
class FooProductListPage(ItemPage[ProductList]):
    ...

# for bar.com we override ProductListPage itself
@handle_urls("bar.com", overrides=ProductListPage)
class FooProductListPage(ProductListPage):
    def links(self):
        # ...

Conclusion

Approach 1 is tempting because it seems to provide a better API - one can receive items in callbacks, and just use them; there is no need e.g. to use item = await ensure_deferred(page.to_item()) in every callback, there is no need to use async def parse. It also gives fully typing support, which can be hit or miss with page objects.

For me it's also much easier to understand - there are no "overrides" anymore, no semi-magical replacement of one class with another. We're just telling "this Page Object returns Product item on example.com website", and then, when Product item is needed on example.com website, we know how to get it.

Approach 2 looks more powerful, especially if web-poet fields are used heavily, and a spider actually needs to access only some of the fields, but not all of them.

Currently I'm +0.25 to implement, follow and recommend Approach 1, mostly because of it's simplicity, even if it doesn't support some features which Approach 2 gives. But I do see advantages of Approach 2 as well, so it's not a strong opinon.

Thoughts?

All per-field tests fail if one of the fields raise an exception

Currently tests work like this:

to_item is called
a test is generated and executed for each field

But if to_item raises an exception, then no fields are available; a test is still generated for each field, and they all fail, each with a huge traceback.

To make things worse, in case of an exception to_item output is not memoized, so to_item is called again for each field, making it all slow.

I think that calling to_item (as opposed to accessing individual fields) is still the way to go, because it gives the final output, and that's what we should be testing.

So, a proposal on how to solve it:

Add one more generted test, which cheks that to_item doesn't raise an exception. It'd be passing in most cases. If it fails, it'd contain a full traceback.
If this "no exceptions" test fails, per-field tests should be skipped. This way we won't show a huge traceback per each field.
It should be possible to still run a per-field skipped test explicitly. The use case: user works on a particular field, and an exception happens in this field. When this test is run, we sould show the full traceback.

We should aslo make sure that we don't call to_item multuple times if it fails, but it can be a separate ticket.

Installs "tests" and "tests_extra" top-level modules

Running pip install web-poet also installs "tests" and "tests_extra" top-level modules:

$ ls -ld .env/lib/python3.10/site-packages/tests*
drwxr-xr-x 5 wrar wrar 100 ноя 11 17:36 .env/lib/python3.10/site-packages/tests
drwxr-xr-x 4 wrar wrar 100 ноя 11 17:36 .env/lib/python3.10/site-packages/tests_extra

I don't think it's intentional?

Introduce alternative constructors to handle nested dependencies [migrate logic from scrapy-poet]

Background

Given the following PO structure below:

import attr

from web_poet.pages import Injectable
from web_poet.page_inputs import ResponseData


@attr.define
class HTMLFromResponse(Injectable):
    response: ResponseData


@attr.define
class WebPage(Injectable):
    response: ResponseData


@attr.define
class HTMLWebPage(WebPage):
    html: HTMLFromResponse

The following would not work since HTMLWebPage is now a subclass of WebPage and it effectively requires both response: ResponseData and html: HTMLFromResponse when using its constructor:

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> page = HTMLWebPage(response)
TypeError: __init__() missing 1 required positional argument: 'html'

We'll need to provide both of the required constructor arguments:

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> html = HTMLFromResponse(response)
>>> page = HTMLWebPage(response, html)

This is a bit tedious since underneath the code, the actual core dependency in the tree would only be ResponseData. If the PO we're instantiating has a deeply nested depedency structure, it would be hard to keep track of all the necessary constructor arguments.

However, when POs are used in a Scrapy Project which uses the InjectionMiddleware provided by https://github.com/scrapinghub/scrapy-poet, this doesn't become a problem since it takes care of handling all necessary dependencies for the PO (since it uses https://github.com/scrapinghub/andi underneath):

import scrapy
from poet_injection_in_scrapy.page_objects import HTMLWebPage


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
	}
    }

    # scrapy-poet provides all the necessary dependencies needed by HTMLWebPage
    def parse(self, response, page: HTMLWebPage):
    	pass

Problem

@gatufo raise a good point about using POs outside the context of a Scrapy project, but ultimately withholds access to the dependency resolution conveniently provided by https://github.com/scrapinghub/scrapy-poet.

Nonetheless, this also expands the use cases supported by POs beyond the spider, like using it in a script, deploying it behind an API, etc.

Proposal

This issue aims to discuss and explore the possibilities of moving the necesary injection logic already implemented in scrapy-poet (reference module) into web-poet itself.

The said migrated logic could then be accessed via the alternative constructor named from_response() (see example below).

>>> response = ResponseData(url='https://example.com/', html='Example Content')
>>> page = HTMLWebPage.from_response(response)

from_response() could be renamed to something else, but this closely follows Scrapy's conventions in its alternative constructors like from_crawler(), from_settings(), etc.

The @property methods of injected dependencies doesn't work

While exploring @victor-torres's suggestion in #10 , I've noticed that the injected dependencies methods decorated with a @property method doesn't work.

So, given the reproducible code snippet below, it's met with:

# stacktrace lines truncated

 File "/Users/path/omitted/site-packages/scrapy/utils/response.py", line 21, in get_base_url
    text = response.text[0:4096]
TypeError: 'property' object is not subscriptable

from scrapy import Spider
from scrapy.http import Response
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage

class QuotesListingPage(ItemWebPage):
    scrapy_response = Response

    def to_item(self):
        return {
            'site_name': self.css('h1 a::text').get(),
            'author_links': LinkExtractor(
                restrict_css='.author + a').extract_links(self.scrapy_response),
        }

class QuotesBaseSpider(Spider):
    name = 'quotes'
    allowed_domains = ['http://quotes.toscrape.com/']
    start_urls = ['http://quotes.toscrape.com//']

    def parse(self, response, page: QuotesListingPage):
        return page.to_item()

I think this bug blocks us a bit since I'm currently creating a Provider for ResponseMeta since our Page Objects needs some information stored there.

@field removes doc

Related to #87.

web_poet.fields doesn't work properly with subclasses

The implementation uses a class attribute to store information about fields. But when an object is subclassed, the attribute is already present, so it's reused. It means that any field defined in a subclass also changes the fields in the base class.

Import some of the functionalities from autoextract_poet.item

It appears that some users are using autoextract_poet without the intention of using https://docs.zyte.com/automatic-extraction.html. Interestingly, the only use case was to utilize these functionalities into their Page Object projects:

These functionalites offer the convenience of automaticallly converting the data format of the items for nested Items.

We might want to consider transferring such functionality into web-poe itself.

create a Class Factory to customize arguments for injected dependencies

This issue aims to discuss customizable arguments to injected dependencies as proposed by @gatufo.

This is in-line with the idea proposed in #18.

The idea is to create a Class Factory that allows us to have a finer control for creating the injected dependencies. Here's an example:

@attr.define
class JsonPage(WebPage):
   semantic_data: ClassFactory(JsonParser, library='json')

@attr.define
class AnotherJsonPage(WebPage):
   semantic_data: ClassFactory(JsonParser, library='demjson')

In the example above, the ClassFactory returns a class which points to the appropriate implementation of the JSON Parser. A class should be returned to comply with the type annotation format which is later used for type inference on andi's build plan.

"build status" badge in README is broken

It seems it links to a non-existing workflow. We should probably add Windows and Linux badges instead.

BOM should take precedence over Content-Type header when detecting the encoding

As explained in scrapy/w3lib#189 and scrapy/scrapy#5601, BOM should take a precedence over Content-Type headers when detecting an encoding.

Currently web-poet.HttpResponse prefers Content-Type header:

import codecs
import web_poet

body = codecs.BOM + "Привет".encode('utf8')
headers = {"Content-Type": "text/html; charset=cp1251"}
resp = web_poet.HttpResponse(url="http://example.com", headers=headers, body=body, status=200)

print(resp.encoding) # cp1251, expected utf-8
print(resp.text) # яюРџСЂРёРІРµС‚ expected 'Привет'

A way to control the name of the folder at which tests are created

There's a way to specify where to create the tests https://scrapy-poet.readthedocs.io/en/stable/testing.html#configuring-the-test-location, but actual paths may look like this: some_project/tests/fixtures/some_project.page_objects.homedepot.com.products.HomedepotComProductPage/test-1 which is way too long and it would be nice to shorten it down to just some_project/tests/fixtures/homedepot.com.products.HomedepotComProductPage/test-1 or even to some_project/tests/fixtures/HomedepotComProductPage/test-1.
So it'd be nice to have a way to control the name of the folder where tests are created at.

	def _get_http_response(url: str) -> HttpResponse:
	response = requests.get(url)
	return HttpResponse(
	response.url,
	body=response.content,
	headers=response.headers,
	)

scrapinghub / web-poet Goto Github PK

web-poet's Introduction

Scrapinghub command line client

Requirements

Installation

Documentation

web-poet's People

Contributors

Stargazers

Watchers

Forkers

web-poet's Issues

API

via page object instance

via item class

Page object setup

A. The page object has all fields using the @field decorator

B. The page object doesn't use the @field decorator but solely utilizes the .to_item() method

C. The page object has some fields using the @field decorator while some fields are populated inside the .to_item() method

Handling field presence

I. Some fields specified in include_fields=[...] are not existing

II. Some fields specified in exclude_fields=[..] are not existing

III. No fields were given in include_fields=[...]

IV. No fields were given in exclude_fields=[...]

Item setup

1. The item has all fields marked as Optional

2. The item has fields marked as required

Summary

Problem

Status Quo

Objective

Possible Approaches

Approach 1

Approach 2

Approach 3

Background

Problem

Idea

Other Notes:

Approach 1: support is directly

Approach 2: define standard generic Page Object classes

Pros/cons

Not extracting all fields with Approach 1

Using custom methods

Conclusion

Background

Problem

Proposal

Recommend Projects

Recommend Topics

Recommend Org

A. The page object has all fields using the `@field` decorator

B. The page object doesn't use the `@field` decorator but solely utilizes the `.to_item()` method

C. The page object has some fields using the `@field` decorator while some fields are populated inside the `.to_item()` method

I. Some fields specified in `include_fields=[...]` are not existing

II. Some fields specified in `exclude_fields=[..]` are not existing

III. No fields were given in `include_fields=[...]`

IV. No fields were given in `exclude_fields=[...]`

1. The item has all fields marked as `Optional`