Giter Site home page Giter Site logo

Comments (11)

Gallaecio avatar Gallaecio commented on June 9, 2024

I guess we could add a warning the first time the limit is reached, in case users want to consider increasing it (to improve performance if they can spare the memory). But we should also allow disabling that warning.

from scrapy.

djay avatar djay commented on June 9, 2024

@Gallaecio I'd suggest in the stats addon to show an indication only when over max_active_size. Currently there is nothing to show there is even anything in the queue so it wasn't obvious the problem wasn't items being dropped via errors or something vs just the queue being paused. or alternatively something external was slowing requests way down.

from scrapy.

GeorgeA92 avatar GeorgeA92 commented on June 9, 2024

When you hit the SCRAPER_SLOT_MAX_ACTIVE_SIZE requests stop being processed silently with no warning.

Expected behavior: 
An error when you hit max_active_size. Stats addon to show number of active requests.

this is.. core feature by design - not a bug. If application accumulated some amount of not processed responses - it stops to send new requests (to prevent RAM overhead).
If application really got stuck because of SCRAPER_SLOT_MAX_ACTIVE_SIZE - it most likely mean that it got stuck because of CPU bottleneck (response processing rate - slower than request sending rate) and in this case it's most likely duplicate of #5302

If you are deferring items in a pipeline that depend on other requests finishing before completing then you can get into a deadlock with little explanation why.

It depends on the way how exactly new requests called inside pipeline (If I correctly interpret meaning of "deferring items in a pipeline that depend on other requests..")
For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size.
Scraper slot is expected to affect responses that aimed to be processed by spider callback methods. Responses from other requests(called from pipeline) is not expected to appear to scraper (and to increase it's active_size value).
Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

Can You please provide.. more detailed info at least dump stats log output (ending log entry starting with [scrapy.statscollectors] INFO: Dumping Scrapy stats:

as this preferences less active sites so the bottleneck site causing...

Is this a broadcrawl case?

from scrapy.

GeorgeA92 avatar GeorgeA92 commented on June 9, 2024

def _needs_backout(self) -> bool:
assert self.slot is not None # typing
assert self.scraper.slot is not None # typing
return (
not self.running
or bool(self.slot.closing)
or self.downloader.needs_backout()
or self.scraper.slot.needs_backout()
)

This code fragment define conditions when application send (or not send) new request:

  1. not self.running or bool(self.slot.closing) if closespider signal triggered (not this case)
  2. self.downloader.needs_backout() if Downloader queue reached size of CONCURRENT_REQUESTS
  3. self.scraper.slot.needs_backout() if application accumulated some not processed responses - SCRAPER_SLOT_MAX_ACTIVE_SIZE check.

@djay
How did You conclude that issue is because of SCRAPER_SLOT_MAX_ACTIVE_SIZE (not busy downloader)?
If it is really broadcrawl - I'd bet on version with downloader bottleneck by timeout exceptions (not scraper slot active size) mentioned on https://docs.scrapy.org/en/latest/topics/broad-crawls.html?highlight=broad-crawls#reduce-download-timeout

But it will be clear only after at least review of [scrapy.statscollectors] INFO: Dumping Scrapy stats log entry I requested.

from scrapy.

djay avatar djay commented on June 9, 2024

Can You please provide.. more detailed info at least dump stats log output

...
2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:19:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:20:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:21:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:22:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:23:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:24:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)

For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size

I will look into how that is done because this is the issue I'm having. To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock. Increasing SCRAPER_SLOT_MAX_ACTIVE_SIZE prevents the problem for longer.

But the point of the ticket was more that it was very hard to work out what was going on and a warning would have helped a lot.

from scrapy.

GeorgeA92 avatar GeorgeA92 commented on June 9, 2024

2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58..

As I mentioned earlier It still can be caused by timeouts. Request that returned timeout Exception (no Response returned) - doesn't count as crawled. During that period that from provided log entries looks like idle - application may send hundreds or even thousands of requests (with no responses). That why I've asked for [scrapy.statscollectors] INFO: Dumping Scrapy stats log entry - not log entries from [scrapy.extensions.logstats]
Alternatively You can try to apply PeriodicLog extension that log required stats periodically(and post it's output here) mentioned on. https://docs.scrapy.org/en/latest/topics/extensions.html#periodic-log-extension.

To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock

We still don't know this for sure. Calling new requests from pipeline is not very good practice. And as I mentioned earlier requests originated from pipeline may not reach scheduler, scraper(do not directly affect it's slot active size). It'd be good to share pipeline code here too.

from scrapy.

djay avatar djay commented on June 9, 2024

We still don't know this for sure.

99% sure. If I increase SCRAPER_SLOT_MAX_ACTIVE_SIZE it solves the problem. I also used the debugger and the AUTOTHROTTLE_DEBUG=True to show that no requests are happening.

Calling new requests from pipeline is not very good practice.

That maybe but my main point is that there is no warning about this. It's not in the documentation. It's not in the log messages. And really if you are stopping the the queue getting processed by hitting some limit I think it's reasonable to assume there should be some log message for this right?

The documention only says a pipeline can return a Deferred. And that seems like a reasonable way to combine information from two different crawls into a single item. I don't actually spawn a request in the pipeline. I do it in another middleware. That then sets a special value in the item which includes a deferred for when the history crawl has been completed and a callback to set values on the item to replace the special values with the real values from the history.
This seemed what returning a Deferred was designed for. What is unexpected is that doing so keeps the original request as still active despite request or response not being involved in the pipeline. I would not have expected that and that is partly why it was so confusing that using Deferred in the pipeline caused a problem.

class DeferHistoryPipeline:
    """ Add in history once history async is complete """
    def process_item(self, item, spider):
        # return item
        i = ItemAdapter(item)
        d = None
        callbacks = []
        for value in i.values():
            if type(value) == tuple and isinstance(value[0], Deferred):
                d, callback = value
                callbacks.append(callback)
        if not callbacks:
            return item

        def do_callbacks(history):
            for needs_item in callbacks:
                needs_item(i, history)
            return item
        return d.addBoth(do_callbacks)

Note I also tried to solve this by setting my history requests to be higher priority but it doesn't seem to be respected so that didn't work. If priority worked then the history requests would be completed before new items are generated and then deferred waiting for even more history requests, so the active_size doesn't increase as fast.
Or if the requests were forgotten by the pipeline step so didn't increase the active_size.
Either of these would be nice. But really all I'm asking for is a warning so others don't waste as much time as I did trying to debug something which is apparently not catered for.

from scrapy.

djay avatar djay commented on June 9, 2024

Something like this would be helpful.

2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min) - paused (213433 > SCRAPER_SLOT_MAX_ACTIVE_SIZE)

from scrapy.

djay avatar djay commented on June 9, 2024

Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

it seems like the comment below isnt the case according to documentation below. and the code - https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py

It also seems like it is standard practice to use Deferred and spawn requests that depend on it from the pipeline - see https://docs.scrapy.org/en/latest/topics/media-pipeline.html

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

What this isn't telling you is that if the files are from a single domain and other urls aren't, then priority won't really help since the if you have any kind of scheduling that limits requests per domain since then other requests from other domains will skip ahead of hte priority requests, create more file requests and you can end up a deadlock. And have no idea why.

from scrapy.

djay avatar djay commented on June 9, 2024

Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

ok. I think I see that self.cawler.engine.download instead of engine.crawl is the special sauce that is needed to prevent this

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py#L144

self.slot.remove_request(request)

from scrapy.

djay avatar djay commented on June 9, 2024

Crawler.download doesn't seem to help. Even if it avoids adding to the active size the active size still grows and eventually blocks all requests thus preventing it shrinking.

Why is a deferred item in a pipeline still considered part of active requests? That is the problem here. All that's left in the pipeline is the item. The request abd response aren't accessible. They are done. They can't be reversed.

from scrapy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.