thomasdondorf / puppeteer-cluster Goto Github PK
View Code? Open in Web Editor NEWPuppeteer Pool, run a cluster of instances in parallel
License: MIT License
Puppeteer Pool, run a cluster of instances in parallel
License: MIT License
some websites have ip blocking measures, i have to change ip for each instance after a while, is there a way i can use puppeteer-cluster to do that without start a new cluster?
Is it possible to use "puppeteer-core" instead of "puppeteer" for the sake of not having to specify the environment variable to exclude a chrome download? I have to manually remove the chrome package from my distribution.
Greenkeeper has created 9 branches so far and is not even able to update the package-lock file on its own.
It's enabled for barely 2 weeks and I already have more work cleaning after it than it was useful. I will remove it, but I might check out dependabot which claims to be able to update the package-lock files. Maybe that one will work better. Otherwise I will just update the dependencies on my own.
Right now most of the logic of the library is in Cluster.ts. It should be split up.
Also need to fix the code smells (or at least some): https://codeclimate.com/github/thomasdondorf/puppeteer-cluster/issues
all windows after queue ends closes, i want the window active
Cool project -- would you consider adding it to awesome-puppeteer?
Hi,
thanks for this awesome library :)
Unfortunately, I do not seem to get it to work, as none of the importing / requiring mechanisms seem to work:
const { Cluster } = require('puppeteer-cluster'); -> Cluster = undefined
import { Cluster } from 'puppeteer-cluster'; -> Cluster = undefined
import Cluster from 'puppeteer-cluster'; -> Cluster = {}
I'm on Node v8.11.4
What am I doing wrong?
I have tried running puppeteer-cluster with AWS Lambda Optimised chrome binaries from https://github.com/Kikobeats/aws-lambda-chrome. but running into errors.
Is it possible to run cluster on AWS Lamda?
So far the domain extraction just takes the hostname from Node.js which includes subdomains.
Should be using TLD.js to make it work with normal top level domains and also for *.co.uk
.
23.1.4
to 23.10.0
.This version is covered by your current version range and after updating it in your project the build failed.
ts-jest is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.
ts-jest
, reloaded!ts-jest
Slack community where you can find some instant helpjest
, typescript
and babel
versionsThe new version differs by 293 commits.
0e5ffed
chore(release): 23.10.0
3665609
Merge pull request #734 from huafu/appveyor-optimizations
45d44d1
Merge branch 'master' into appveyor-optimizations
76e2fe5
ci(appveyor): cache npm versions as well
191c464
ci(appveyor): try to improve appveyor's config
0f31b42
Merge pull request #733 from huafu/fix-test-snap
661853a
Merge branch 'master' into fix-test-snap
aa7458a
Merge pull request #731 from kulshekhar/dependabot/npm_and_yarn/tslint-plugin-prettier-2.0.0
70775f1
ci(lint): run lint scripts in series instead of parallel
a18e919
style(fix): exclude package.json from tslint rules
011b580
test(config): stop using snapshots for pkg versions
7e5a3a1
build(deps-dev): bump tslint-plugin-prettier from 1.3.0 to 2.0.0
fbe90a9
Merge pull request #730 from kulshekhar/dependabot/npm_and_yarn/@types/node-10.10.1
a88456e
build(deps-dev): bump @types/node from 10.9.4 to 10.10.1
54fd239
Merge pull request #729 from kulshekhar/dependabot/npm_and_yarn/prettier-1.14.3
There are 250 commits in total.
See the full diff
There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot 🌴
Is there any way to limit the number of tasks used per browser instance? I'm thinking of something along the lines (perhaps) of tasksPerInstance: 1000
, and then the cluster will track the number of tasks that have been used in a specific browser instance and then whenever that limit is reached will kill that browser instance and launch another, as a (potential) shield against browser memory growth. Its a technique I've seen used in other process pooling models (I think some of the Apache web server modules let you specify a maximum number of requests a worker process will serve before it is terminated and replaced with a fresh process).
@kanxue660 reported high CPU usage: #11 (comment)
I am interested in exploring using puppeteer cluster in a Jest test context.
I am not able to import or require - without getting an Unexpected identifier error on that line.
import Cluster from 'puppeteer-cluster';
// or
const { Cluster } = require('puppeteer-cluster');
Error:
static async launch(options) {
^^^^^^
SyntaxError: Unexpected identifier
Thanks...
I'm trying to run puppeteer in a cluster using this library however when I try the following I get no errors however the plugin itself doesn't load. The same arguments work perfectly with puppeteer directly.
Anyone have an idea why this is happening?
cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 2,
monitor: false,
puppeteerOptions: {
headless: true,
args: [
'--no-sandbox',
'--disable-gpu',
'--enable-usermedia-screen-capturing',
'--allow-http-screen-capture',
'--auto-select-desktop-capture-source=ppc',
'--load-extension=' + __dirname+'/chrome-plugin',
'--disable-extensions-except=' + __dirname+'/chrome-plugin',
'--disable-infobars',
'--window-size=1920,1080',
],
}
});
Hello,
Is it possible to get the version of browser ?
I do This
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
monitor: true,
retryLimit: 0,
timeout: 180000,
}
});
await cluster.queue('www.example.com', main);
// Display browser version
// console.log(cluster.browser.version()) ?
const main = async ({ page, data: url }) => {
await page.goto(url);
const results = await page.evaluate(async () => {
debugger;
let title = document.title;
return title;
}).then((data) => {
console.log(data);
});
};
Related code:
const cluster = await Cluster.launch({
puppeteerOptions: {
headless: true,
ignoreHTTPSErrors: env.IGNORE_HTTPS || false,
args: ['--disable-http2'],
timeout: env.PUPPETEER_TIMEOUT || 60000, //attempt 2
},
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: parseInt(env.MAX_WORKER) || 4,
skipDuplicateUrls: false,
monitor: env.MONITOR === 'true' || false,
timeout: env.PUPPETEER_TIMEOUT || 60000, //attempt 1
});
I've tried to set timeout in cluster launch options and passed it to puppeteerOptions
, all failed. Log says timeout was still at 30000.
app:cluster:err TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
app:cluster:err at Promise.then (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/FrameManager.js:1276:21)
app:cluster:err -- ASYNC --
app:cluster:err at Frame.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:144:27)
app:cluster:err at Page.goto (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/Page.js:624:49)
app:cluster:err at Page.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:145:23)
app:cluster:err at GenericHandler.processPage (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:47:21)
app:cluster:err at GenericHandler.process (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:94:16)
app:cluster:err at module.exports (/home/bambang/project/om-screenshoot/src/handlers/site.js:27:24)
app:cluster:err at Worker.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:56:54)
app:cluster:err at Generator.next (<anonymous>)
app:cluster:err at fulfilled (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:4:58)
app:cluster:err at process.internalTickCallback (internal/process/next_tick.js:77:7) +789ms
Any guidance on how to trace/ fix this issue?
Can I use this in micro
/ express
/ etc and be able to have an endpoint process a "screenshot" task and return a value when the task completes?
Is this a thing?
Hi Guys,
I try to use express to wrap a little REST API above of puppeteer, but i see the only way to add a new url is use the cluster queue. My concern is that i do "parallel" requests i will receive the wrong answer, i mean the content of another url.
My question is: Is possible to run synchronous tasks ?
Thanks and sorry for may bad english.
Hello there.
I'm testing puppeteer-cluster in a project I'm working on and I have problem in headless mode.
Because I cannot send the original code, I tried to reproduce the problem using the simple Queuing functions example. When headless is false it works like a charm. When set to true, nothing happens.
Am I missing something?
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 3,
puppeteerOptions: {
headless: true
},
monitor: true
});
await cluster.queue(async ({ page }) => {
await page.goto('http://www.wikipedia.org');
await page.screenshot({path: 'wikipedia.png'});
});
await cluster.queue(async ({ page }) => {
await page.goto('https://www.google.com/');
const pageTitle = await page.evaluate(() => document.title);
console.log('google');
});
await cluster.queue(async ({ page }) => {
await page.goto('https://www.imdb.com/');
console.log('IMDB');
});
await cluster.idle();
await cluster.close();
})();
puppeteer v1.9.0,
puppeteer-cluster v0.11.2
I will appreciate your help. Thank you
This goes away from the traditional idea of "New browser per task" or "New page per task". This one is more about keeping a cluster of pages open the entire time and periodically refreshing them.
Why would I want to do to this you ask?...
Let's say I have a page that has d3 charts and I want to turn all the charts into images (my actual product isn't d3 charts). If the charts update in real time and I want a screenshot every 5 minutes (assuming there are 100s of charts), opening a page / browser each time takes a while. If I just kept the tab open and kept screenshotting, then I'd have the screenshots a lot sooner.
Now for my more techy way: I'm exposing a function to the site I'm screenshotting, and that function retrieves arguments from puppeteer/chrome to render specific items on the page.
Sudo-Code
// browser
if (typeof window.getRenderOpts === 'function') {
window.getRenderOpts().then((opts) => updateChart(opts));
}
// puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
async function getPageAndLock(): Promise<Page> {
// .. get's a page that's idle or waits till one becomes idle...
}
async function pageIsReady(): Promise<Page> {
// ...
}
... (req, res) => {
const page = await getPageAndLock();
await page.evaluate(`render(`${JSON.stringify({/* ... */)`)`);
const screenshot = await page.screenshot(/* ... */);
pageIsReady(page);
res.send(screenshot);
}
It's probably out of the scope of this library, but I'm not sure if anyone would be interested in this type of concurrency.
I did benchmarks of "New browser per task", "New page per task", "Same page per task", and keeping the page open and taking screenshots periodically is A LOT FASTER. I can get these benchmarks back if you want me to. This was when I was experimenting.
Hi,
I am having a little trouble figuring out a way to return a variable from a queued function.
Given the sample function-queuing-complex.js
example, I have tried using both return
and resolve
in extractTitle
since I read from the README that cluster.queue
returns a Promise. Both resulted in undefined
being returned. A Promise.all
doesn't seem to work either. Is this a bug or am I doing something wrong?
const extractTitle = async ({ page, data: url }) => {
await page.goto(url);
const pageTitle = await page.evaluate(() => document.title);
// How do I return pageTitle to use outside this async function?
};
const task1= await cluster.queue("https://reddit.com/", extractTitle);
const task2 = await cluster.queue("https://twitter.com/", extractTitle);
Promise.all([task1, task2]).then(result => console.log(result)); // returns undefined
const { Cluster } = require('../dist');
(async () => {
// Create a cluster with 2 workers
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 2,
puppeteerOptions: {headless: false}
});
// Define a task (in this case: screenshot of page)
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const path = url.replace(/[^a-zA-Z]/g, '_') + '.png';
await page.screenshot({ path });
console.log(`Screenshot of ${url} saved: ${path}`);
});
// Add some pages to queue
await cluster.queue('https://www.google.com');
await cluster.queue('https://www.wikipedia.org');
await cluster.queue('https://github.com/');
// Shutdown after everything is done
await cluster.idle();
await cluster.close();
})();
This only generated screenshots for wiki and github. Browser also hung for some time.
My use case is the following: create a cluster with Cluster.CONCURRENCY_BROWSER
and never close it.
const { connect } = require('amqplib');
const { Cluster } = require('puppeteer-cluster');
const { crawler, puppeteerOptions, redis } = require('./docroot');
const { Resource } = require('./docroot/Component');
(async ({ RABBITMQ_USER, RABBITMQ_PASS, RABBITMQ_HOST, RABBITMQ_PORT, RABBITMQ_QUEUE, RABBITMQ_THREADS, REDIS_LIST }) => {
const cluster = await Cluster.launch({
monitor: true,
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: Number(RABBITMQ_THREADS),
puppeteerOptions,
});
const channel = await (await connect(`amqp://${RABBITMQ_USER}:${RABBITMQ_PASS}@${RABBITMQ_HOST}:${RABBITMQ_PORT}`)).createChannel();
channel.assertQueue(RABBITMQ_QUEUE, {
durable: false,
});
await cluster.task(async ({ data, page }) => {
const { resource, message } = data;
const metadata = await crawler.crawl(resource, page);
await redis.rpush(REDIS_LIST, JSON.stringify(metadata));
channel.ack(message);
});
channel.consume(RABBITMQ_QUEUE, message => {
const content = JSON.parse(message.content.toString('utf8'));
const resource = new Resource(content.resource);
if (Array.isArray(content.links_to_check_for)) {
resource.setLinks(content.links_to_check_for);
}
cluster.queue({ resource, message });
});
})(process.env);
As you can see above, the cluster's queue gets filled once RabbitMQ sends something. This means the process is kinda daemon and shouldn't be stopped. I'm worry about of whether the pages that cluster creates should be closed (await page.close()
after const metadata = await crawler.crawl(resource, page);
) once not needed anymore or is it done automatically?
I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:
sameDomainDelay
and skipDuplicateUrls
. Detection of domains should use TLD.js for example. Documentation should be better. And there should be a way to provide the URL without using data or { url: ... }CONCURRENCY_BROWSER
the default as it is more robust?Cluster.queue
for example)cluster.execute
function which executes the job concurrency
should be concurrencyType
maxConcurrency
maybe maxWorkers
?A common use-case would be to have many different tests spread out over multiple files.
This seems to be exactly what I need to speed up my tests - but I don't understand how to utilise it to run tests in different files in parallell.
Ex;
One test suite in home/tests/e2e/LoginPage.test.js
Another test suite in loan/tests/e2e/OverviewPage.test.js
I understand I could use it within the same test suite - but what about running different test suites in parallel?
A failing expect
call will not lead to an error if it gets caught. See jestjs/jest#3917 for discussion. This might currently lead to failing test that are not reported as the generous error handling catches them.
Three options:
taskerror
to error
which will make sure that Node.js crashes in that case. Users will have to take care of the error handler then.throwOnTaskerror
so that task errors will not get caughtIt's a guessing game right now to figure out the correct options based on CPU / Memory.
It'd be nice to know the best options for puppeteer-cluster on various systems.
I'm trying to figure out the best deployment strategy for "real-time" processing. Many small instances vs fewer large instances, etc.
Ref: https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html
when i run puppeteer cluster with 100 urls,it only crawls 98 or 99 urls ..
here is my code
`const { Cluster } = require('puppeteer-cluster');
var link=[];
var total=0;
var start=3;
const size= process.argv[2];
for(let i =0;i<size;i++)
{ link.push(process.argv[start++]);}
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency:50,
timeout:400000,
monitor:false
});
await cluster.task(async ({ page, data: url }) => {
const response=await page.goto(url,{timeout:100000,waitUntil: 'networkidle2'});
console.log(response.url());
if(response.status()==404){
console.log('program encountered error');
return;
}
total++; (counts the number of urls)
const hrefs = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return [].map.call(anchors, a => a.href);
});
});
for(let i =0;i<size;i++){await cluster.queue(link[i]); }
await cluster.idle();
await cluster.close();
console.log(total);
process.exit(0);
})();`
At a loss at dealing with this scenario.
https://stackoverflow.com/questions/46669788/how-to-handle-popups-in-puppeteer
puppeteer/puppeteer#2968 (comment)
You might need to listen 'targetcreated' from browser
context to track the popup and do some actions on it.
There is a requirement, a database table, there is a lot of url, need to be accessed one by one, I want to read in batches to prevent too much memory, the program has been running, only every once in a while to read the database, and then execute, Can you add an idle event to this loop? Read the database only when you are idle, I wonder if this method is feasible
Branch | Build failing 🚨 |
---|---|
Dependency | debug |
Current Version | 3.1.0 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
debug is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.
A long-awaited release to debug
is available now: 3.2.0
.
chrome.storage
(or make the storage backend pluggable): 71d2aa7supports-color@5
: 285dfe1enable()
(#517): ab5083fHuge thanks to @DanielRuf, @EirikBirkeland, @KyleStay, @Qix-, @abenhamdine, @alexey-pelykh, @DiegoRBaquero, @febbraro, @kwolfy, and @TooTallNate for their help!
The new version differs by 25 commits.
dec4b15
3.2.0
3ca2331
clean up builds
9f4f8f5
remove needless command aliases in makefile
623c08e
no longer checking for BROWSER=1
57cde56
fix tests
62822f1
clean up makefile
833b6f8
fix tests
ba8a424
move to XO (closes #397)
2d2509e
add .editorconfig
853853f
bump vulnerable packages
7e1d5d9
add yarn-error.log to .gitignore
e43e5fe
add instance extends feature (#524)
207a6a2
Fix nwjs support (#569)
05b0ceb
add Node.js 10, remove Node.js 4 (#583)
02b9ea9
Add TVMLKit support (#579)
There are 25 commits in total.
See the full diff
There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot 🌴
The code is as follows
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
sameDomainDelay:20*1000 //Will it be delayed for 20 seconds?
});
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// Store screenshot, do something else
});
await cluster.queue('http://www.google.com/a.html');
await cluster.queue('http://www.google.com/b.html');
await cluster.queue('http://www.google.com/c.html');
// many more pages
await cluster.idle();
await cluster.close();
})();
My question is, if a.HTML opens first, then B.Html,c.HTML, will be delayed 20 seconds to open it?
Do not understand how this sameDomainDelay uses
This software seems really interesting and useful.
Do you have any plans on changing your open source license from the GNU General Public License 3.0 to something else, such as Apache License 2.0, BSD or MIT?
I'm asking since many individuals and organisations cannot use GPL-licensed software. Thanks.
Anyway for it to listen to some sort of queue for work
i need page turning,how can do it?
This error is happening. Puppeteer version 1.9 seems to be the reason.
Edit: I just wrote crappy code (forgot an await
). Fix is on the way..
Something like:
cluster.on('monitor', (data) => {
console.log(data);
});
Hello,
I have got few questions, not sure if should have created multiple issues.
Question: Using below code (example code), when I am debugging, browser window closes suddenly not letting me finish stepping through my code. Am I missing any config option?
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
monitor: true,
retryLimit: 0,
puppeteerOptions: {
headless: false,
devtools: true,
defaultViewport: {
width: 1920,
height: 1080
}
}
});
await cluster.queue('www.example.com', main);
const main = async ({ page, data: url }) => {
await page.goto(url);
const results = await page.evaluate(async () => {
debugger;
let title = document.title;
return title;
}).then((data) => {
console.log(data);
});
};
thanks
I'm trying to puppeteer-cluster with minimal.js example. I'm getting the following error:
D:\Developpement\NodeJS\minimal>node minimal.js
internal/modules/cjs/loader.js:583
throw err;
^Error: Cannot find module '../dist'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
at Function.Module._load (internal/modules/cjs/loader.js:507:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (D:\Developpement\NodeJS\minimal\minimal.js:1:83)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)
With my configuration the directory ../dist
does not exist.
I have
24/01/2019 15:10 .
24/01/2019 15:10 ..
24/01/2019 15:10 minimal
24/01/2019 15:04 node_modules
I replace const { Cluster } = require('../dist');
by const { Cluster } = require('puppeteer-cluster');
It's OK.
First of all, thank you for your work!
I have a minor suggestion on improving type checking fot the cluster.queue()
method.
Now we have this:
public async queue(
data: JobData | TaskFunction,
taskFunction?: TaskFunction,
): Promise<void> {
...
}
As one can inspect, JobData
is of type any
and it is used both as a first argument to cluster.queue()
method as well as the data
property of the TaskFunctionArguments
interface. This approach does not provide sufficient type checking when we call the cluster.queue()
method with two arguments. I'd suggest to use generic types here like this:
type QueueFunction<T> = (arg: QueueFunctionArguments<T>) => Promise<void>;
interface QueueFunctionArguments<T> {
page: puppeteer.Page;
data: T;
worker: {
id: number;
};
}
public async queue<T>(
data: T | TaskFunction,
taskFunction?: QueueFunction<T>,
): Promise<void> {
...
}
I'm gonna document some puppeteer-cluster test runs, to see how the different concurrency types and options work together.
Feel free to add your own runs
First of all, great project!
I tried the example while set maxConcurrency = 50/100. What I noticed is when set to 100, pretty much every time the program will hang somewhere. When set to 50, program will hang sometimes. Not sure what caused this issue. Thanks for any input.
Cool project, but I am confused that why you use await
in your example:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
// Is `await` necessary?
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// Store screenshot, do something else
});
await cluster.queue('http://www.google.com/');
await cluster.queue('http://www.wikipedia.org/');
// many more pages
await cluster.idle();
await cluster.close();
})();
when you define a task
or try to add some queues, why await
? I try to remove them and is ok to do that.
Hello There.
I have multiple instances of puppeteer that scrapes data from some sites. after scraping, each instance uses process.send() to output the data so that it can be saved to a database. I would love to know if it's possible to listen to data/message sent by each instance so that they can be saved to the DB same way we have cluster.on('taskerror')
event handler and how to implement it. Regards.
Currently the library does not catch asynchronously thrown errors. That means code like this can lead to errors:
page.on('dialog', async dialog => {
await dialog.dismiss();
});
The correct way right now is to put a try catch block around the code inside the function. This is a problem, as the library might still come to a stop when the code is badly written.
process.on('uncaughtException')
and/or process.on('unhandledRejection')
to handle all kind of errors. This might interfere with bigger applications that have this kind of handling already build in.Note sure which one is the way to go. Open for ideas and opinions.
Just came across a memory leak issue which hangs the program after some time.
concurrency: Cluster.CONCURRENCY_BROWSER,
workerCreationDelay: 200,
maxConcurrency: 20
There is such a scene,I have a number of URLs, want to open the bulk of parallel, but if the URL is under the same domain name, each page needs to delay a few seconds to open, to avoid being blocked by the target webmaster,How to do it, I follow the following settings do not seem to
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 10,
retryLimit: 5,//失败重试5次
retryDelay: 2000,//重试间隔2秒
sameDomainDelay:30*1000,//统一域名下,延时10秒打开,貌似没用
skipDuplicateUrls: true,//跳过重复url
workerCreationDelay: 500,//标签打开延时
I want to create a node server for scrapping using puppeteer (pass search term in GET request to scrap google search results)
currently my server is not able process more then 5 parallel request after its goes out of memory
I'm looking for an index of worker that's simply a number of current worker calling Task Function
This code below always captures a screenshot to the same file screen.png
.
const puppeteer = require('puppeteer-core');
const {
Cluster,
} = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
puppeteer,
puppeteerOptions: {
executablePath: 'C:\\Users\\..\\AppData\\Local\\Google\\Chrome SxS\\Application\\chrome.exe',
},
});
await cluster.task(async ({
page,
data: url,
}) => {
await page.goto(url);
await page.screenshot({
path: './screen.png',
});
// Store screenshot, do something else
});
await cluster.queue('http://www.google.com/');
await cluster.queue('http://www.wikipedia.org/');
// many more pages
await cluster.idle();
await cluster.close();
})();
I want something like:
await cluster.task(async ({
page,
data: url,
wIndex,
}) => {
await page.goto(url);
await page.screenshot({
path: `./screen_${wIndex}.png`,
});
// Store screenshot, do something else
});
With wIndex
is a number of current Worker.
Simple solution for this example can be done by using URL of the current queue (https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/minimal.js)
But what if it working with the same URL for each queue?
P/s: Also I want to launch puppeteer with the different launch.Options
on each Worker
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.