Comments (15)
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I'm not sure. When I run scrapy in celery as a separate process it doesn't log to the file when it crashes. There is something still going on though because ocassionally it stops and keeps putting out the same page/item count indefinitely without stopping and I have another issue where it doesn't kill the chrome process correctly but I'll investigate more and start another issue for that if I find anything. (A week of use spawned a quarter of a million zombie processes...)
from scrapy-playwright.
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I added export NODE_OPTIONS=--max-old-space-size=8192
in ~/.profile
file and run Scrapy script. However, the error Aborted (core dumped)
still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.
from scrapy-playwright.
If I have to close it manually, can you point me to the documentation about this?
https://github.com/scrapy-plugins/scrapy-playwright#closing-a-context-during-a-crawl
from scrapy-playwright.
There is no need to patch the handler code, closing a context can be done using the existing API. I understand it might seem a bit verbose but I don't want to create a whole DSL around this to handle context/page creation/deletion.
The new error is because you're trying to download pages with an already closed context, which makes sense if you're closing the context immediately after downloading each page. It's hard to say without knowing exactly what self.get_domain
returns (I suppose something involving urllib.parse.urlparse(url).netloc
, but I'm just guessing), but I suspect you might have some urls in your list that correspond to the same domain(s). I think you could probably get a good performance by grouping URLs in batches (let's say, 1K per context) and closing each context after that, but that might be too complex; a quick solution to download one response per domain and have non-clashing contexts would be to pass a uuid.uuid4()
object as context name for each URL.
Given that the underlying Allocation failed - JavaScript heap out of memory
seems to be an upstream issue, I don't see much else we can do on this side to prevent it.
from scrapy-playwright.
Hmm, I got the same error after a few hours when scraping just a single domain. Could it be related to error #15 which pops up a fair bit?
Any way I can increase the memory heap?
Context '1': new page created, page count is 1 (1 for all contexts)
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: 0xa18150 node::Abort() [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
2: 0xa1855c node::OnFatalError(char const*, char const*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
3: 0xb9715e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
4: 0xb974d9 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
5: 0xd54755 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
6: 0xd650a8 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
7: 0xd2bd9d v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::RootIndex, int, v8::internal::Object, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
8: 0xd2be90 v8::internal::Handlev8::internal::FixedArray v8::internal::Factory::NewFixedArrayWithMapv8::internal::FixedArray(v8::internal::RootIndex, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
9: 0xf5abd0 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Allocate(v8::internal::Isolate*, int, v8::internal::AllocationType) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
10: 0xf5ac81 v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::Rehash(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap, int) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
11: 0xf5b2cb v8::internal::OrderedHashTable<v8::internal::OrderedHashMap, 2>::EnsureGrowable(v8::internal::Isolate*, v8::internal::Handlev8::internal::OrderedHashMap) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
12: 0x1051b38 v8::internal::Runtime_MapGrow(int, unsigned long*, v8::internal::Isolate*) [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
13: 0x140a8f9 [/home/garnax/project/lib/python3.9/site-packages/playwright/driver/node]
Aborted
Crawled 4171 pages (at 8 pages/min), scraped 114799 items (at 186 items/min)
from scrapy-playwright.
Are you using a single context for this domain? If so, you're falling into microsoft/playwright#6319.
This seems like an issue on the Node.js side of things. I'm no JS developer, so take the following with a grain of salt, but from what I've found you should be able to increase the memory limit by setting NODE_OPTIONS=--max-old-space-size=<size>
as an environment variable.
Sources and further reading:
- npm/npm#12238 (comment)
- https://medium.com/the-node-js-collection/node-options-has-landed-in-8-x-5fba57af703d
- https://nodejs.org/dist/latest-v8.x/docs/api/cli.html#cli_node_options_options
- https://nodejs.org/api/cli.html#cli_max_old_space_size_size_in_megabytes
from scrapy-playwright.
Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.
from scrapy-playwright.
Thank you, setting the NODE_OPTIONS seems to have solved the memory issue and It can run for 24h+ without crashing in a single context.
Hi @xanrag How did you do to fix the JavaScript heap out of memory
error? Which options do you setup?
from scrapy-playwright.
Hi @xanrag How did you do to fix the
JavaScript heap out of memory
error? Which options do you setup?
Just the memory setting, I added this to my docker-compose and it seems to work:
environment:
- NODE_OPTIONS=--max-old-space-size=8192
from scrapy-playwright.
Hi @xanrag How did you do to fix the
JavaScript heap out of memory
error? Which options do you setup?Just the memory setting, I added this to my docker-compose and it seems to work:
environment:
- NODE_OPTIONS=--max-old-space-size=8192
Thanks @xanrag I will try to test my script with new env setting.
from scrapy-playwright.
@phongtnit
I met the issue too. so created context per page and close page and context at the same time like you.
but I faced to different issue about chrome process fork error over 7000 pages. i am searching it now.
@xanrag Hi, did you get "Aborted (core dumped)" error anymore?
I added
export NODE_OPTIONS=--max-old-space-size=8192
in~/.profile
file and run Scrapy script. However, the errorAborted (core dumped)
still occurs when Scrapy Playwright crawled more than 10k urls, sometime about 100k urls.
from scrapy-playwright.
from scrapy-playwright.
@elacuesta hey I'm having this problem were my computer starts freezing after 1/2hours of running my crawler. I'm pretty sure it's due to this playwright issue you linked (microsoft/playwright#6319) where it's taking up more and more memory. It seems like a workaround is to recreate the page every x minutes but I'm not sure how to do this.
I'm already doing all playwright requests with playwright_context="new"
and that doesn't fix it.
I'm new to this, can you give me pointers on how I can create a new page or context (?) every x minutes? I'm currently unable to figure this out from the documentation on my own.
I've added my spider in case you're interested
spider
import logging
from typing import Optional
import bs4
import scrapy
from scrapy_playwright.page import PageMethod
from jobscraper import storage
from jobscraper.items import CybercodersJob
class CybercodersSpider(scrapy.Spider):
name = 'cybercoders'
allowed_domains = ['cybercoders.com']
loading_delay = 2500
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'}
request_meta = dict(
playwright=True,
playwright_context="new",
# You can define page actions (https://playwright.dev/python/docs/api/class-page)
playwright_page_methods=[
PageMethod("wait_for_timeout", loading_delay)
# TODO instead of waiting, wait for the page to load (look for a specific element)
]
)
def get_search_url(self, page: Optional[int] = 1) -> str:
page_string = f"page={page}&" if page else ""
return f"https://www.cybercoders.com/jobs/?{page_string}&worklocationtypeid=3"
def start_requests(self):
yield scrapy.http.Request(
self.get_search_url(),
headers=self.headers,
cb_kwargs={'page': 1},
meta=self.request_meta,
callback=self.parse
)
def parse(self, response, **kwargs):
"""
Parses the job-search page
"""
# get all job_links
job_links = response.css('div.job-title a::attr(href)').getall()
# If there are no job links on the page, the page is empty so we can stop
if not job_links:
return
# Go to the next search page
yield scrapy.http.Request(
self.get_search_url(kwargs['page'] + 1),
headers=self.headers,
cb_kwargs={'page': kwargs['page'] + 1},
meta=self.request_meta,
callback=self.parse
)
# Go to each job page
for link in job_links:
job_id = link.split('/')[-1]
if job_id and storage.has_job_been_scraped(CybercodersJob, job_id):
continue
yield response.follow("https://www.cybercoders.com" + link, callback=self.parse_job, headers=self.headers,
meta=self.request_meta)
def parse_job(self, response, **kwargs):
"""
Parses a job page
"""
try:
soup = bs4.BeautifulSoup(response.body, 'html.parser')
details = dict(
id=response.url.split('/')[-1],
url=response.url,
description=soup.find('div', class_='job-details-content').find('div',
class_='job-details') if soup.find(
'div', class_='job-details-content') else None,
title=response.css('div.job-title h1::text').get() if response.css('div.job-title h1::text') else None,
skills=response.css('div.skills span.skill-name::text').getall() if response.css(
'div.skills span.skill-name::text') else None,
location=response.css('div.job-info-main div.location span::text').get() if response.css(
'div.job-info-main div.location span::text') else None,
compensation=response.css('div.job-info-main div.wage span::text').get() if response.css(
'div.job-info-main div.wage span::text') else None,
posted_date=response.css('div.job-info-main div.posted span::text').get() if response.css(
'div.job-info-main div.posted span::text') else None,
)
for key in ['title', 'description', 'url']:
if details[key] is None:
logging.warning(f"Missing value for {key} in {response.url}")
yield CybercodersJob(
**details
)
except Exception as e:
logging.error(f"Something went wrong parsing {response.url}: {e}")
from scrapy-playwright.
Passing playwright_context="new"
for all requests will not make a new context for each request, it will only make all requests go trough a single context named "new".
I'd recommend generating randomly named contexts, maybe using random
or uuid
. That said, one context per request is probably too much, perhaps a good middle point would be one context for each listing page and its derived links, i.e. use the same context for response.follow
calls but generate a new one for the requests to increment the listing page number.
from scrapy-playwright.
@elacuesta Oh ok, good idea. Thanks!
After looking online I'm not 100% sure whether I have to close a context manually or if just using a new playwright_context="new-name"
is enough? If I have to close it manually, can you point me to the documentation about this?
from scrapy-playwright.
Related Issues (20)
- How to stop closing browser
- NotImplementedError as well as other errors when simply executing spider HOT 1
- Can't implement multiple contexts HOT 1
- Retry request on every error HOT 1
- Can the 'context_launch_lock' in the '_create_page' function be safely removed? HOT 1
- Overridden method for Playwright request to original=POST new=GET HOT 1
- About memory leak HOT 4
- Guideline on how to use scrapy-playwright based on a real corporative use case HOT 3
- Cannot download binary file (PDF) with Chromium headless=new mode HOT 13
- Scrapy hangs when an exception is raised in applying page method HOT 5
- How to use playwright-stealth ? HOT 2
- how to use playwright with SitemapSpider HOT 1
- Supporting for Windows HOT 6
- Playwright consistently captures a 404 request code despite the successful loading of the product on Target(retailer website). HOT 2
- Scrapy playwright infinite scroll isn't working HOT 1
- Question from the optimizing Scrapy with Playwright for Concurrent Page Handling and Response Capture with async_generator TypeError HOT 9
- Modifying headers when sending out the request HOT 3
- Unable to save downloaded file HOT 2
- How to disable file download? HOT 3
- Contracts and testing best practices with Scrapy-Playwright HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrapy-playwright.