Giter Site home page Giter Site logo

Comments (3)

elacuesta avatar elacuesta commented on May 22, 2024 1

It seems to grab the request URL from the XHR in the networks tabs also

Indeed, just like in a browser, all resources (images, scripts, stylesheets, etc) are downloaded. This causes increased memory consumption, and slower download speeds because your bandwidth is shared for all these additional resources. There's already #26 to work on some way to restrict the routes to be downloaded.

Regarding the multiple Chromium processes, those are most likely one process per page (window). You are passing playwright_include_page=True for all requests, but you're not closing them afterwards. playwright_include_page is only necessary if you intend to do additional manipulation of the page in the callback, which you are not, so I'd recommend removing it to allow the download handler to close the pages after they're used.

from scrapy-playwright.

lime-n avatar lime-n commented on May 22, 2024

I believe the memory issues are caused by all the garbage collection during the scraping process. I get a lot of responses to links that I haven't called for - I'm guessing this is how the playwright-scrapy implementation works? It seems to grab the request URL from the XHR in the networks tabs also, which consumers a lot during the process.

On another note, is there a way to implement this on a virtual machine? I only had 55 links but I'm preparing to work with 1000, and I'm unsure my PC can handle the processing.

It seems to work if I create the requests for the url pages beforehand, save them as a pandas dataframe and then read them into start_requests. However, I'm not sure as to why with the current implementation not all urls work?

from scrapy-playwright.

sick-pupil avatar sick-pupil commented on May 22, 2024

I have the same issue!! If I scrap too many pages, some pages will not be scraped and failed, even I scrap 5 pages, 10 pages, or 50 pages. It seems very strange.
By the way, I use wsl to run my script.

from scrapy-playwright.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.