Giter Site home page Giter Site logo

Comments (15)

xanrag avatar xanrag commented on May 22, 2024 2

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I'm not sure. When I run scrapy in celery as a separate process it doesn't log to the file when it crashes. There is something still going on though because ocassionally it stops and keeps putting out the same page/item count indefinitely without stopping and I have another issue where it doesn't kill the chrome process correctly but I'll investigate more and start another issue for that if I find anything. (A week of use spawned a quarter of a million zombie processes...)

from scrapy-playwright.

phongtnit avatar phongtnit commented on May 22, 2024 1

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I added export NODE_OPTIONS=--max-old-space-size=8192 in ~/.profile file and run Scrapy script. However, the error Aborted (core dumped) still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.

from scrapy-playwright.

elacuesta avatar elacuesta commented on May 22, 2024 1

If I have to close it manually, can you point me to the documentation about this?

https://github.com/scrapy-plugins/scrapy-playwright#closing-a-context-during-a-crawl

from scrapy-playwright.

elacuesta avatar elacuesta commented on May 22, 2024

There is no need to patch the handler code, closing a context can be done using the existing API. I understand it might seem a bit verbose but I don't want to create a whole DSL around this to handle context/page creation/deletion.
The new error is because you're trying to download pages with an already closed context, which makes sense if you're closing the context immediately after downloading each page. It's hard to say without knowing exactly what self.get_domain returns (I suppose something involving urllib.parse.urlparse(url).netloc, but I'm just guessing), but I suspect you might have some urls in your list that correspond to the same domain(s). I think you could probably get a good performance by grouping URLs in batches (let's say, 1K per context) and closing each context after that, but that might be too complex; a quick solution to download one response per domain and have non-clashing contexts would be to pass a uuid.uuid4() object as context name for each URL.
Given that the underlying Allocation failed - JavaScript heap out of memory seems to be an upstream issue, I don't see much else we can do on this side to prevent it.

from scrapy-playwright.

xanrag avatar xanrag commented on May 22, 2024

Hmm, I got the same error after a few hours when scraping just a single domain. Could it be related to error #15 which pops up a fair bit?
Any way I can increase the memory heap?

Context '1': new page created, page count is 1 (1 for all contexts)
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: 0xa18150 node::Abort() [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
5: 0xd54755 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
8: 0xd2be90 v8::internal::Handlev8::internal::FixedArray v8::internal::Factory::NewFixedArrayWithMapv8::internal::FixedArray(v8::internal::RootIndex, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
9: 0xf5abd0 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Allocate(v8::internal::Isolate*, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
10: 0xf5ac81 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Rehash(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap, int) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
11: 0xf5b2cb v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::EnsureGrowable(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
12: 0x1051b38 v8::internal::Runtime_MapGrow(int, unsigned long*, v8::internal::Isolate*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
13: 0x140a8f9 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
Aborted
Crawled 4171 pages (at 8 pages/min), scraped 114799 items (at 186 items/min)

from scrapy-playwright.

elacuesta avatar elacuesta commented on May 22, 2024

Are you using a single context for this domain? If so, you're falling into microsoft/playwright#6319.

This seems like an issue on the Node.js side of things. I'm no JS developer, so take the following with a grain of salt, but from what I've found you should be able to increase the memory limit by setting NODE_OPTIONS=--max-old-space-size=<size> as an environment variable.

Sources and further reading:

from scrapy-playwright.

xanrag avatar xanrag commented on May 22, 2024

Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.

from scrapy-playwright.

phongtnit avatar phongtnit commented on May 22, 2024

Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

from scrapy-playwright.

xanrag avatar xanrag commented on May 22, 2024

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

Just the memory setting, I added this to my docker-compose and it seems to work:
environment:
- NODE_OPTIONS=--max-old-space-size=8192

from scrapy-playwright.

phongtnit avatar phongtnit commented on May 22, 2024

Hi @xanrag How did you do to fix the JavaScript heap out of memory error? Which options do you setup?

Just the memory setting, I added this to my docker-compose and it seems to work:
environment:

  • NODE_OPTIONS=--max-old-space-size=8192

Thanks @xanrag I will try to test my script with new env setting.

from scrapy-playwright.

hi-time avatar hi-time commented on May 22, 2024

@phongtnit
I met the issue too. so created context per page and close page and context at the same time like you.
but I faced to different issue about chrome process fork error over 7000 pages. i am searching it now.

@xanrag Hi, did you get "Aborted (core dumped)" error anymore?

I added export NODE_OPTIONS=--max-old-space-size=8192 in ~/.profile file and run Scrapy script. However, the error Aborted (core dumped) still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.

from scrapy-playwright.

elacuesta avatar elacuesta commented on May 22, 2024

9fe18b5

from scrapy-playwright.

Stijn-B avatar Stijn-B commented on May 22, 2024

@elacuesta hey I'm having this problem were my computer starts freezing after 1/2hours of running my crawler. I'm pretty sure it's due to this playwright issue you linked (microsoft/playwright#6319) where it's taking up more and more memory. It seems like a workaround is to recreate the page every x minutes but I'm not sure how to do this.

I'm already doing all playwright requests with playwright_context="new" and that doesn't fix it.

I'm new to this, can you give me pointers on how I can create a new page or context (?) every x minutes? I'm currently unable to figure this out from the documentation on my own.

I've added my spider in case you're interested

spider
import logging
from typing import Optional

import bs4
import scrapy
from scrapy_playwright.page import PageMethod

from jobscraper import storage
from jobscraper.items import CybercodersJob

class CybercodersSpider(scrapy.Spider):
    name = 'cybercoders'
    allowed_domains = ['cybercoders.com']

    loading_delay = 2500

    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'}
    request_meta = dict(
        playwright=True,
        playwright_context="new",
        # You can define page actions (https://playwright.dev/python/docs/api/class-page)
        playwright_page_methods=[
            PageMethod("wait_for_timeout", loading_delay)
            # TODO instead of waiting, wait for the page to load (look for a specific element)
        ]
    )

    def get_search_url(self, page: Optional[int] = 1) -> str:
        page_string = f"page={page}&" if page else ""
        return f"https://www.cybercoders.com/jobs/?{page_string}&worklocationtypeid=3"

    def start_requests(self):
        yield scrapy.http.Request(
            self.get_search_url(),
            headers=self.headers,
            cb_kwargs={'page': 1},
            meta=self.request_meta,
            callback=self.parse
        )

    def parse(self, response, **kwargs):
        """
        Parses the job-search page
        """

        # get all job_links
        job_links = response.css('div.job-title a::attr(href)').getall()

        # If there are no job links on the page, the page is empty so we can stop
        if not job_links:
            return

        # Go to the next search page
        yield scrapy.http.Request(
            self.get_search_url(kwargs['page'] + 1),
            headers=self.headers,
            cb_kwargs={'page': kwargs['page'] + 1},
            meta=self.request_meta,
            callback=self.parse
        )

        # Go to each job page
        for link in job_links:
            job_id = link.split('/')[-1]
            if job_id and storage.has_job_been_scraped(CybercodersJob, job_id):
                continue
            yield response.follow("https://www.cybercoders.com" + link, callback=self.parse_job, headers=self.headers,
                                  meta=self.request_meta)

    def parse_job(self, response, **kwargs):
        """
        Parses a job page
        """

        try:
            soup = bs4.BeautifulSoup(response.body, 'html.parser')

            details = dict(
                id=response.url.split('/')[-1],
                url=response.url,
                description=soup.find('div', class_='job-details-content').find('div',
                                                                                class_='job-details') if soup.find(
                    'div', class_='job-details-content') else None,
                title=response.css('div.job-title h1::text').get() if response.css('div.job-title h1::text') else None,
                skills=response.css('div.skills span.skill-name::text').getall() if response.css(
                    'div.skills span.skill-name::text') else None,
                location=response.css('div.job-info-main div.location span::text').get() if response.css(
                    'div.job-info-main div.location span::text') else None,
                compensation=response.css('div.job-info-main div.wage span::text').get() if response.css(
                    'div.job-info-main div.wage span::text') else None,
                posted_date=response.css('div.job-info-main div.posted span::text').get() if response.css(
                    'div.job-info-main div.posted span::text') else None,
            )

            for key in ['title', 'description', 'url']:
                if details[key] is None:
                    logging.warning(f"Missing value for {key} in {response.url}")

            yield CybercodersJob(
                **details
            )

        except Exception as e:
            logging.error(f"Something went wrong parsing {response.url}: {e}")

from scrapy-playwright.

elacuesta avatar elacuesta commented on May 22, 2024

Passing playwright_context="new" for all requests will not make a new context for each request, it will only make all requests go trough a single context named "new".
I'd recommend generating randomly named contexts, maybe using random or uuid. That said, one context per request is probably too much, perhaps a good middle point would be one context for each listing page and its derived links, i.e. use the same context for response.follow calls but generate a new one for the requests to increment the listing page number.

from scrapy-playwright.

Stijn-B avatar Stijn-B commented on May 22, 2024

@elacuesta Oh ok, good idea. Thanks!
After looking online I'm not 100% sure whether I have to close a context manually or if just using a new playwright_context="new-name" is enough? If I have to close it manually, can you point me to the documentation about this?

from scrapy-playwright.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.