Comments (11)
I guess we could add a warning the first time the limit is reached, in case users want to consider increasing it (to improve performance if they can spare the memory). But we should also allow disabling that warning.
from scrapy.
@Gallaecio I'd suggest in the stats addon to show an indication only when over max_active_size. Currently there is nothing to show there is even anything in the queue so it wasn't obvious the problem wasn't items being dropped via errors or something vs just the queue being paused. or alternatively something external was slowing requests way down.
from scrapy.
When you hit the SCRAPER_SLOT_MAX_ACTIVE_SIZE requests stop being processed silently with no warning.
Expected behavior:
An error when you hit max_active_size. Stats addon to show number of active requests.
this is.. core feature by design - not a bug. If application accumulated some amount of not processed responses - it stops to send new requests (to prevent RAM overhead).
If application really got stuck because of SCRAPER_SLOT_MAX_ACTIVE_SIZE
- it most likely mean that it got stuck because of CPU bottleneck (response processing rate - slower than request sending rate) and in this case it's most likely duplicate of #5302
If you are deferring items in a pipeline that depend on other requests finishing before completing then you can get into a deadlock with little explanation why.
It depends on the way how exactly new requests called inside pipeline (If I correctly interpret meaning of "deferring items in a pipeline that depend on other requests..")
For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size.
Scraper slot is expected to affect responses that aimed to be processed by spider callback methods. Responses from other requests(called from pipeline) is not expected to appear to scraper
(and to increase it's active_size value).
Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue
didn't applied to them.
Can You please provide.. more detailed info at least dump stats log output (ending log entry starting with [scrapy.statscollectors] INFO: Dumping Scrapy stats:
as this preferences less active sites so the bottleneck site causing...
Is this a broadcrawl case?
from scrapy.
Lines 197 to 205 in d5233bb
This code fragment define conditions when application send (or not send) new request:
not self.running or bool(self.slot.closing)
if closespider signal triggered (not this case)self.downloader.needs_backout()
if Downloader queue reached size ofCONCURRENT_REQUESTS
self.scraper.slot.needs_backout()
if application accumulated some not processed responses -SCRAPER_SLOT_MAX_ACTIVE_SIZE
check.
@djay
How did You conclude that issue is because of SCRAPER_SLOT_MAX_ACTIVE_SIZE
(not busy downloader)?
If it is really broadcrawl - I'd bet on version with downloader bottleneck by timeout exceptions (not scraper slot active size) mentioned on https://docs.scrapy.org/en/latest/topics/broad-crawls.html?highlight=broad-crawls#reduce-download-timeout
But it will be clear only after at least review of [scrapy.statscollectors] INFO: Dumping Scrapy stats
log entry I requested.
from scrapy.
Can You please provide.. more detailed info at least dump stats log output
...
2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:19:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:20:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:21:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:22:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:23:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:24:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size
I will look into how that is done because this is the issue I'm having. To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock. Increasing SCRAPER_SLOT_MAX_ACTIVE_SIZE prevents the problem for longer.
But the point of the ticket was more that it was very hard to work out what was going on and a warning would have helped a lot.
from scrapy.
2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58..
As I mentioned earlier It still can be caused by timeouts. Request that returned timeout Exception (no Response returned) - doesn't count as crawled. During that period that from provided log entries looks like idle - application may send hundreds or even thousands of requests (with no responses). That why I've asked for [scrapy.statscollectors] INFO: Dumping Scrapy stats
log entry - not log entries from [scrapy.extensions.logstats]
Alternatively You can try to apply PeriodicLog
extension that log required stats periodically(and post it's output here) mentioned on. https://docs.scrapy.org/en/latest/topics/extensions.html#periodic-log-extension.
To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock
We still don't know this for sure. Calling new requests from pipeline is not very good practice. And as I mentioned earlier requests originated from pipeline may not reach scheduler, scraper(do not directly affect it's slot active size). It'd be good to share pipeline code here too.
from scrapy.
We still don't know this for sure.
99% sure. If I increase SCRAPER_SLOT_MAX_ACTIVE_SIZE it solves the problem. I also used the debugger and the AUTOTHROTTLE_DEBUG=True to show that no requests are happening.
Calling new requests from pipeline is not very good practice.
That maybe but my main point is that there is no warning about this. It's not in the documentation. It's not in the log messages. And really if you are stopping the the queue getting processed by hitting some limit I think it's reasonable to assume there should be some log message for this right?
The documention only says a pipeline can return a Deferred. And that seems like a reasonable way to combine information from two different crawls into a single item. I don't actually spawn a request in the pipeline. I do it in another middleware. That then sets a special value in the item which includes a deferred for when the history crawl has been completed and a callback to set values on the item to replace the special values with the real values from the history.
This seemed what returning a Deferred was designed for. What is unexpected is that doing so keeps the original request as still active despite request or response not being involved in the pipeline. I would not have expected that and that is partly why it was so confusing that using Deferred in the pipeline caused a problem.
class DeferHistoryPipeline:
""" Add in history once history async is complete """
def process_item(self, item, spider):
# return item
i = ItemAdapter(item)
d = None
callbacks = []
for value in i.values():
if type(value) == tuple and isinstance(value[0], Deferred):
d, callback = value
callbacks.append(callback)
if not callbacks:
return item
def do_callbacks(history):
for needs_item in callbacks:
needs_item(i, history)
return item
return d.addBoth(do_callbacks)
Note I also tried to solve this by setting my history requests to be higher priority but it doesn't seem to be respected so that didn't work. If priority worked then the history requests would be completed before new items are generated and then deferred waiting for even more history requests, so the active_size doesn't increase as fast.
Or if the requests were forgotten by the pipeline step so didn't increase the active_size.
Either of these would be nice. But really all I'm asking for is a warning so others don't waste as much time as I did trying to debug something which is apparently not catered for.
from scrapy.
Something like this would be helpful.
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min) - paused (213433 > SCRAPER_SLOT_MAX_ACTIVE_SIZE)
from scrapy.
Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.
it seems like the comment below isnt the case according to documentation below. and the code - https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py
It also seems like it is standard practice to use Deferred and spawn requests that depend on it from the pipeline - see https://docs.scrapy.org/en/latest/topics/media-pipeline.html
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).
What this isn't telling you is that if the files are from a single domain and other urls aren't, then priority won't really help since the if you have any kind of scheduling that limits requests per domain since then other requests from other domains will skip ahead of hte priority requests, create more file requests and you can end up a deadlock. And have no idea why.
from scrapy.
Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.
ok. I think I see that self.cawler.engine.download instead of engine.crawl is the special sauce that is needed to prevent this
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py#L144
Line 312 in d5233bb
from scrapy.
Crawler.download doesn't seem to help. Even if it avoids adding to the active size the active size still grows and eventually blocks all requests thus preventing it shrinking.
Why is a deferred item in a pipeline still considered part of active requests? That is the problem here. All that's left in the pipeline is the item. The request abd response aren't accessible. They are done. They can't be reversed.
from scrapy.
Related Issues (20)
- Remove tests/requirements.txt
- Add an extra-deps job for pypy
- Document the SpiderState extension
- WindowsRunSpiderCommandTest isn't skipped properly in the pinned envs
- More documentation needed about the robots.txt protocol HOT 3
- GZipPlugin does not work with S3 HOT 3
- AttributeError: 'Decompressor' object has no attribute 'process' HOT 1
- Handle robots.txt files not utf-8 encoded HOT 3
- SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 HOT 3
- Decompressor' object has no attribute 'process' HOT 1
- Failed to scrape data from Auction website with Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) error HOT 3
- Not able to use requests inside with scrapy. HOT 1
- ast.NameConstant is deprecated and will be removed in Python 3.14; use ast.Constant instead HOT 1
- Scrapy and Great Expectations: Error - __provides__ HOT 13
- Different behavior of `get_header` between urllib.Request and WrappedRequest HOT 4
- test_get_func_args() expectation changes in new Python point releases HOT 6
- Receiving 403 while using proxy server and a valid user agent HOT 1
- Media Pipeline is not filtering the duplicate file requests HOT 3
- Per spider DNS_RESOLVER doesn't work HOT 1
- Scrapy Spider Fails to Process All URLs from CSV on Large URL Sets HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapy.