I have been trying to figure out how to scrape all the pages provided by the following

Not all links are scraped and slow scraping about scrapy-playwright HOT 3 CLOSED

scrapy-plugins commented on May 22, 2024

Not all links are scraped and slow scraping

from scrapy-playwright.

Comments (3)

elacuesta commented on May 22, 2024 1

It seems to grab the request URL from the XHR in the networks tabs also

Indeed, just like in a browser, all resources (images, scripts, stylesheets, etc) are downloaded. This causes increased memory consumption, and slower download speeds because your bandwidth is shared for all these additional resources. There's already #26 to work on some way to restrict the routes to be downloaded.

Regarding the multiple Chromium processes, those are most likely one process per page (window). You are passing playwright_include_page=True for all requests, but you're not closing them afterwards. playwright_include_page is only necessary if you intend to do additional manipulation of the page in the callback, which you are not, so I'd recommend removing it to allow the download handler to close the pages after they're used.

from scrapy-playwright.

lime-n commented on May 22, 2024

I believe the memory issues are caused by all the garbage collection during the scraping process. I get a lot of responses to links that I haven't called for - I'm guessing this is how the playwright-scrapy implementation works? It seems to grab the request URL from the XHR in the networks tabs also, which consumers a lot during the process.

On another note, is there a way to implement this on a virtual machine? I only had 55 links but I'm preparing to work with 1000, and I'm unsure my PC can handle the processing.

It seems to work if I create the requests for the url pages beforehand, save them as a pandas dataframe and then read them into start_requests. However, I'm not sure as to why with the current implementation not all urls work?

from scrapy-playwright.

sick-pupil commented on May 22, 2024

I have the same issue!! If I scrap too many pages, some pages will not be scraped and failed, even I scrap 5 pages, 10 pages, or 50 pages. It seems very strange.
By the way, I use wsl to run my script.

from scrapy-playwright.

Recommend Projects

Not all links are scraped and slow scraping about scrapy-playwright HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent