thomasdondorf / puppeteer-cluster Goto Github PK

View Code? Open in Web Editor NEW

3.1K 44.0 297.0 1.23 MB

Puppeteer Pool, run a cluster of instances in parallel

License: MIT License

TypeScript 99.43% JavaScript 0.57%

puppeteer pool pooling headless-chrome cluster node

puppeteer-cluster's People

Contributors

Stargazers

Watchers

Forkers

mhaseebkhan jiangtao marios88 datasunny tim0991 ntanh1402 justin-roche tchandrap vikramtiwari hlindset mtekode cyxou ksiomelo saidtezel wynterding hungud cjjgunner geekplux immanuelsegol mtatarau90 asiellb bitomancer corentinb nikolait amrmohamedfoad mdunham hoangpq jalbstmeijer rozzy ilantc lgrzybowski daniellevinson kkyon vedantseta braeden google38438 wokao98 saurabhnemade liruiqing raymondmeg yanghakoo long-woo edblighter emperorsreeni mangione77 ermolaev1337 jibay alexleventer g-elfling ioedeveloper ishan-marikar sascha1337 shannonmoeller hackerart liege christian-fei imduffy15 zenith-rahim honsa longjohncoder jackmac92 ivanovb yerrak volkansenturk2012 piyu-sh chrisemory gsidsid slava-v23 danishack stanxii quesurifn famatte69 arbazkiraak joone mat-twg zakaria-ahmed tedhopps behnaam mosen candicandi iqbmo04 hy9125 ringliwei daerwang vitalics ramiloif gitbenxing oceanswave hochanh nisthar jeastman19 boubiro leadscloud honzamac amkirwan deldrid1 yuttasakcom helkias2 laserygd debianco

puppeteer-cluster's Issues

is there a way to use different proxies for every instances?

some websites have ip blocking measures, i have to change ip for each instance after a while, is there a way i can use puppeteer-cluster to do that without start a new cluster?

Use "puppeteer-core" instead of "puppeteer"

Is it possible to use "puppeteer-core" instead of "puppeteer" for the sake of not having to specify the environment variable to exclude a chrome download? I have to manually remove the chrome package from my distribution.

Remove Greenkeeper (and maybe replace with dependabot?)

Greenkeeper has created 9 branches so far and is not even able to update the package-lock file on its own.

It's enabled for barely 2 weeks and I already have more work cleaning after it than it was useful. I will remove it, but I might check out dependabot which claims to be able to update the package-lock files. Maybe that one will work better. Otherwise I will just update the dependencies on my own.

Split up Cluster.ts

Right now most of the logic of the library is in Cluster.ts. It should be split up.

Also need to fix the code smells (or at least some): https://codeclimate.com/github/thomasdondorf/puppeteer-cluster/issues

how to prevent close after run test

all windows after queue ends closes, i want the window active

Suggestion: add to awesome-puppeteer

Cool project -- would you consider adding it to awesome-puppeteer?

importing / requiring Cluster

Hi,

thanks for this awesome library :)

Unfortunately, I do not seem to get it to work, as none of the importing / requiring mechanisms seem to work:

const { Cluster } = require('puppeteer-cluster'); -> Cluster = undefined
import { Cluster } from 'puppeteer-cluster'; -> Cluster = undefined
import Cluster from 'puppeteer-cluster'; -> Cluster = {}

I'm on Node v8.11.4

What am I doing wrong?

Would it be possible to run puppeteer-cluster on AWS Lambda?

I have tried running puppeteer-cluster with AWS Lambda Optimised chrome binaries from https://github.com/Kikobeats/aws-lambda-chrome. but running into errors.
Is it possible to run cluster on AWS Lamda?

Use TLD.js for sameDomainDelay

So far the domain extraction just takes the hostname from Node.js which includes subdomains.

Should be using TLD.js to make it work with normal top level domains and also for *.co.uk.

An in-range update of ts-jest is breaking the build 🚨

The devDependency ts-jest was updated from `23.1.4` to `23.10.0`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

ts-jest is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).
✅ coverage/coveralls: First build on greenkeeper/ts-jest-23.10.0 at 0.0% (Details).

Release Notes for 23.10.0

`ts-jest`, reloaded!

lots of new features including full type-checking and internal cache (see changelog)
improved performances
Babel not required anymore
improved (and growing) documentation
a ts-jest Slack community where you can find some instant help
end-to-end isolated testing over multiple jest, typescript and babel versions

Commits

The new version differs by 293 commits.

0e5ffed chore(release): 23.10.0
3665609 Merge pull request #734 from huafu/appveyor-optimizations
45d44d1 Merge branch 'master' into appveyor-optimizations
76e2fe5 ci(appveyor): cache npm versions as well
191c464 ci(appveyor): try to improve appveyor's config
0f31b42 Merge pull request #733 from huafu/fix-test-snap
661853a Merge branch 'master' into fix-test-snap
aa7458a Merge pull request #731 from kulshekhar/dependabot/npm_and_yarn/tslint-plugin-prettier-2.0.0
70775f1 ci(lint): run lint scripts in series instead of parallel
a18e919 style(fix): exclude package.json from tslint rules
011b580 test(config): stop using snapshots for pkg versions
7e5a3a1 build(deps-dev): bump tslint-plugin-prettier from 1.3.0 to 2.0.0
fbe90a9 Merge pull request #730 from kulshekhar/dependabot/npm_and_yarn/@types/node-10.10.1
a88456e build(deps-dev): bump @types/node from 10.9.4 to 10.10.1
54fd239 Merge pull request #729 from kulshekhar/dependabot/npm_and_yarn/prettier-1.14.3

There are 250 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Limit number of tasks per browser instance

Is there any way to limit the number of tasks used per browser instance? I'm thinking of something along the lines (perhaps) of tasksPerInstance: 1000, and then the cluster will track the number of tasks that have been used in a specific browser instance and then whenever that limit is reached will kill that browser instance and launch another, as a (potential) shield against browser memory growth. Its a technique I've seen used in other process pooling models (I think some of the Apache web server modules let you specify a maximum number of requests a worker process will serve before it is terminated and replaced with a fresh process).

High CPU utilization during idle times

@kanxue660 reported high CPU usage: #11 (comment)

Using in Jest context / Node.js version 6 support

I am interested in exploring using puppeteer cluster in a Jest test context.

I am not able to import or require - without getting an Unexpected identifier error on that line.

import Cluster from 'puppeteer-cluster';
// or
const { Cluster } = require('puppeteer-cluster');

Error:

static async launch(options) {
                     ^^^^^^
    SyntaxError: Unexpected identifier

Thanks...

Extensions are not loading on any concurrency model

I'm trying to run puppeteer in a cluster using this library however when I try the following I get no errors however the plugin itself doesn't load. The same arguments work perfectly with puppeteer directly.

Anyone have an idea why this is happening?

    cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        monitor: false,
        puppeteerOptions: {
            headless: true,
            args: [
                '--no-sandbox',  
                '--disable-gpu',
                '--enable-usermedia-screen-capturing',
                '--allow-http-screen-capture',
                '--auto-select-desktop-capture-source=ppc',
                '--load-extension=' + __dirname+'/chrome-plugin',
                '--disable-extensions-except=' + __dirname+'/chrome-plugin',
                '--disable-infobars',
                '--window-size=1920,1080',
            ],
        }
    });

Is It possible to get browser version?

Hello,

Is it possible to get the version of browser ?
I do This

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    timeout: 180000,
    }
  });

await cluster.queue('www.example.com', main);

// Display browser version
// console.log(cluster.browser.version()) ?

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

Timeout config not honored

Related code:

  const cluster = await Cluster.launch({
    puppeteerOptions: {
      headless: true,
      ignoreHTTPSErrors: env.IGNORE_HTTPS || false,
      args: ['--disable-http2'],
      timeout: env.PUPPETEER_TIMEOUT || 60000,            //attempt 2
    },
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: parseInt(env.MAX_WORKER) || 4,
    skipDuplicateUrls: false,
    monitor: env.MONITOR === 'true' || false,
    timeout: env.PUPPETEER_TIMEOUT || 60000,              //attempt 1
  });

I've tried to set timeout in cluster launch options and passed it to puppeteerOptions, all failed. Log says timeout was still at 30000.

app:cluster:err TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
  app:cluster:err     at Promise.then (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/FrameManager.js:1276:21)
  app:cluster:err   -- ASYNC --
  app:cluster:err     at Frame.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:144:27)
  app:cluster:err     at Page.goto (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/Page.js:624:49)
  app:cluster:err     at Page.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:145:23)
  app:cluster:err     at GenericHandler.processPage (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:47:21)
  app:cluster:err     at GenericHandler.process (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:94:16)
  app:cluster:err     at module.exports (/home/bambang/project/om-screenshoot/src/handlers/site.js:27:24)
  app:cluster:err     at Worker.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:56:54)
  app:cluster:err     at Generator.next (<anonymous>)
  app:cluster:err     at fulfilled (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:4:58)
  app:cluster:err     at process.internalTickCallback (internal/process/next_tick.js:77:7) +789ms

Any guidance on how to trace/ fix this issue?

page.setrequestinterception in puppeteer cluster not working

Usage in an HTTP environment.

Can I use this in micro / express / etc and be able to have an endpoint process a "screenshot" task and return a value when the task completes?

Is this a thing?

Crawler on demand instead a queue

Hi Guys,

I try to use express to wrap a little REST API above of puppeteer, but i see the only way to add a new url is use the cluster queue. My concern is that i do "parallel" requests i will receive the wrong answer, i mean the content of another url.

My question is: Is possible to run synchronous tasks ?

Thanks and sorry for may bad english.

Problem with headless: true

Hello there.
I'm testing puppeteer-cluster in a project I'm working on and I have problem in headless mode.
Because I cannot send the original code, I tried to reproduce the problem using the simple Queuing functions example. When headless is false it works like a charm. When set to true, nothing happens.
Am I missing something?

const { Cluster } = require('puppeteer-cluster');

(async () => {

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 3,
        puppeteerOptions: {
            headless: true
        },
        monitor: true
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('http://www.wikipedia.org');
        await page.screenshot({path: 'wikipedia.png'});
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.google.com/');
        const pageTitle = await page.evaluate(() => document.title);
        console.log('google');
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.imdb.com/');
        console.log('IMDB');
    });
    await cluster.idle();
    await cluster.close();
})();

puppeteer v1.9.0,
puppeteer-cluster v0.11.2

I will appreciate your help. Thank you

Same URL Concurrency

This goes away from the traditional idea of "New browser per task" or "New page per task". This one is more about keeping a cluster of pages open the entire time and periodically refreshing them.

Why would I want to do to this you ask?...

Let's say I have a page that has d3 charts and I want to turn all the charts into images (my actual product isn't d3 charts). If the charts update in real time and I want a screenshot every 5 minutes (assuming there are 100s of charts), opening a page / browser each time takes a while. If I just kept the tab open and kept screenshotting, then I'd have the screenshots a lot sooner.

Now for my more techy way: I'm exposing a function to the site I'm screenshotting, and that function retrieves arguments from puppeteer/chrome to render specific items on the page.

Sudo-Code

// browser
if (typeof window.getRenderOpts === 'function') {
    window.getRenderOpts().then((opts) => updateChart(opts));
}

// puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

async function getPageAndLock(): Promise<Page> {
  // .. get's a page that's idle or waits till one becomes idle...
}

async function pageIsReady(): Promise<Page> {
  // ...
}



... (req, res) => {
    const page = await getPageAndLock();

    await page.evaluate(`render(`${JSON.stringify({/* ... */)`)`);

    const screenshot = await page.screenshot(/* ... */);

    pageIsReady(page);

    res.send(screenshot);
}

It's probably out of the scope of this library, but I'm not sure if anyone would be interested in this type of concurrency.

I did benchmarks of "New browser per task", "New page per task", "Same page per task", and keeping the page open and taking screenshots periodically is A LOT FASTER. I can get these benchmarks back if you want me to. This was when I was experimenting.

Unable to return a variable from a queued function

Hi,

I am having a little trouble figuring out a way to return a variable from a queued function.

Given the sample function-queuing-complex.js example, I have tried using both return and resolve in extractTitle since I read from the README that cluster.queue returns a Promise. Both resulted in undefined being returned. A Promise.all doesn't seem to work either. Is this a bug or am I doing something wrong?

const extractTitle = async ({ page, data: url }) => {
	await page.goto(url);
	const pageTitle = await page.evaluate(() => document.title);
 
        // How do I return pageTitle to use outside this async function?
};

const task1= await cluster.queue("https://reddit.com/", extractTitle);
const task2 = await cluster.queue("https://twitter.com/", extractTitle);
Promise.all([task1, task2]).then(result => console.log(result)); // returns undefined

CONCURRENCY_PAGE with headless: false hangs up and breaks

const { Cluster } = require('../dist');

(async () => {
    // Create a cluster with 2 workers
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        puppeteerOptions: {headless: false}
    });

    // Define a task (in this case: screenshot of page)
    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);

        const path = url.replace(/[^a-zA-Z]/g, '_') + '.png';
        await page.screenshot({ path });
        console.log(`Screenshot of ${url} saved: ${path}`);
    });

    // Add some pages to queue
    await cluster.queue('https://www.google.com');
    await cluster.queue('https://www.wikipedia.org');
    await cluster.queue('https://github.com/');

    // Shutdown after everything is done
    await cluster.idle();
    await cluster.close();
})();

This only generated screenshots for wiki and github. Browser also hung for some time.

Should queued task take care about closing the page?

My use case is the following: create a cluster with Cluster.CONCURRENCY_BROWSER and never close it.

const { connect } = require('amqplib');
const { Cluster } = require('puppeteer-cluster');
const { crawler, puppeteerOptions, redis } = require('./docroot');
const { Resource } = require('./docroot/Component');

(async ({ RABBITMQ_USER, RABBITMQ_PASS, RABBITMQ_HOST, RABBITMQ_PORT, RABBITMQ_QUEUE, RABBITMQ_THREADS, REDIS_LIST }) => {
  const cluster = await Cluster.launch({
    monitor: true,
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: Number(RABBITMQ_THREADS),
    puppeteerOptions,
  });

  const channel = await (await connect(`amqp://${RABBITMQ_USER}:${RABBITMQ_PASS}@${RABBITMQ_HOST}:${RABBITMQ_PORT}`)).createChannel();

  channel.assertQueue(RABBITMQ_QUEUE, {
    durable: false,
  });

  await cluster.task(async ({ data, page }) => {
    const { resource, message } = data;
    const metadata = await crawler.crawl(resource, page);

    await redis.rpush(REDIS_LIST, JSON.stringify(metadata));

    channel.ack(message);
  });

  channel.consume(RABBITMQ_QUEUE, message => {
    const content = JSON.parse(message.content.toString('utf8'));
    const resource = new Resource(content.resource);

    if (Array.isArray(content.links_to_check_for)) {
      resource.setLinks(content.links_to_check_for);
    }

    cluster.queue({ resource, message });
  });
})(process.env);

As you can see above, the cluster's queue gets filled once RabbitMQ sends something. This means the process is kinda daemon and shouldn't be stopped. I'm worry about of whether the pages that cluster creates should be closed (await page.close() after const metadata = await crawler.crawl(resource, page);) once not needed anymore or is it done automatically?

Roadmap for v1.0

I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:

My goals:

Maybe:

Provide a simple but robust data store with the library
Rename API: Some parts of API are rather unfortunate
- concurrency should be concurrencyType
- maxConcurrency maybe maxWorkers?
Provide queue function to the task function for a more functional syntax (so that you don't need to access cluster from inside the task

Not planned (for now):

~~#8 (comment) Mixed concurrency models~~
- Reason: It does not work well together with the idea of having a sandbox (which part of the browser/page/context stuff should be sandboxed then)

Usage with JEST tests in different files?

A common use-case would be to have many different tests spread out over multiple files.

This seems to be exactly what I need to speed up my tests - but I don't understand how to utilise it to run tests in different files in parallell.

Ex;
One test suite in home/tests/e2e/LoginPage.test.js
Another test suite in loan/tests/e2e/OverviewPage.test.js

I understand I could use it within the same test suite - but what about running different test suites in parallel?

Tests might silently fail

A failing expect call will not lead to an error if it gets caught. See jestjs/jest#3917 for discussion. This might currently lead to failing test that are not reported as the generous error handling catches them.

Three options:

Rename taskerror to error which will make sure that Node.js crashes in that case. Users will have to take care of the error handler then.
Enable an option throwOnTaskerror so that task errors will not get caught
Just take care of it in the tests

Puppeteer Cluster Best Practices

It's a guessing game right now to figure out the correct options based on CPU / Memory.

It'd be nice to know the best options for puppeteer-cluster on various systems.

I'm trying to figure out the best deployment strategy for "real-time" processing. Many small instances vs fewer large instances, etc.

Ref: https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html

multiple crawl does not crawl all my urls.

when i run puppeteer cluster with 100 urls,it only crawls 98 or 99 urls ..
here is my code

`const { Cluster } = require('puppeteer-cluster');
var link=[];
var total=0;
var start=3;
const size= process.argv[2];
for(let i =0;i<size;i++)
{ link.push(process.argv[start++]);}

(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency:50,
timeout:400000,
monitor:false
});

await cluster.task(async ({ page, data: url }) => {
const response=await page.goto(url,{timeout:100000,waitUntil: 'networkidle2'});
console.log(response.url());

 if(response.status()==404){
	console.log('program encountered error');		
	return;
}
total++;   (counts the number of urls)

const hrefs = await page.evaluate(() => {
		const anchors = document.querySelectorAll('a');
		 return [].map.call(anchors, a => a.href);
						});

});

for(let i =0;i<size;i++){await cluster.queue(link[i]); }

await cluster.idle();
await cluster.close();

console.log(total);
process.exit(0);
})();`

A way to handle a popup?

At a loss at dealing with this scenario.

https://stackoverflow.com/questions/46669788/how-to-handle-popups-in-puppeteer
puppeteer/puppeteer#2968 (comment)

You might need to listen 'targetcreated' from browser context to track the popup and do some actions on it.

Can you add an idle event

There is a requirement, a database table, there is a lot of url, need to be accessed one by one, I want to read in batches to prevent too much memory, the program has been running, only every once in a while to read the database, and then execute, Can you add an idle event to this loop? Read the database only when you are idle, I wonder if this method is feasible

An in-range update of debug is breaking the build 🚨

Version 3.2.0 of debug was just published.

Branch	Build failing 🚨
Dependency	debug
Current Version	3.1.0
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

debug is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).
✅ coverage/coveralls: First build on greenkeeper/debug-3.2.0 at 74.478% (Details).

Release Notes

3.2.0

A long-awaited release to debug is available now: 3.2.0.

Due to the delay in release and the number of changes made (including bumping dependencies in order to mitigate vulnerabilities), it is highly recommended maintainers update to the latest package version and test thoroughly.

Minor Changes

bump vulnerable packages: 853853f
Fix nwjs support (#569): 207a6a2
add instance extends feature (#524): e43e5fe
Add TVMLKit support (#579): 02b9ea9

Patches

move to XO (closes #397): ba8a424
clean up builds: 3ca2331
remove needless command aliases in makefile: 9f4f8f5
no longer checking for BROWSER=1: 623c08e
fix tests: 57cde56
clean up makefile: 62822f1
fix tests: 833b6f8
add .editorconfig: 2d2509e
add yarn-error.log to .gitignore: 7e1d5d9
add Node.js 10, remove Node.js 4 (#583): 05b0ceb
Improve usability of Windows notes w/ examples for prompts & npm script (#577): 1ad1e4a
Drop usage of chrome.storage (or make the storage backend pluggable): 71d2aa7
Detect 'process' package: 225c66f
Update ms to 2.1.1 (#539): 22f9932
Update .npmignore (#527): a5ca7a2
fix colors with supports-color@5: 285dfe1
Document enable() (#517): ab5083f
refactor to make the common code be a setup function (#507): 7116906
Simplify and improve: da51af8

Credits

Huge thanks to @DanielRuf, @EirikBirkeland, @KyleStay, @Qix-, @abenhamdine, @alexey-pelykh, @DiegoRBaquero, @febbraro, @kwolfy, and @TooTallNate for their help!

Commits

The new version differs by 25 commits.

dec4b15 3.2.0
3ca2331 clean up builds
9f4f8f5 remove needless command aliases in makefile
623c08e no longer checking for BROWSER=1
57cde56 fix tests
62822f1 clean up makefile
833b6f8 fix tests
ba8a424 move to XO (closes #397)
2d2509e add .editorconfig
853853f bump vulnerable packages
7e1d5d9 add yarn-error.log to .gitignore
e43e5fe add instance extends feature (#524)
207a6a2 Fix nwjs support (#569)
05b0ceb add Node.js 10, remove Node.js 4 (#583)
02b9ea9 Add TVMLKit support (#579)

There are 25 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Will it be delayed for 20 seconds?

The code is as follows
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
sameDomainDelay:20*1000 //Will it be delayed for 20 seconds?
});

await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// Store screenshot, do something else
});

await cluster.queue('http://www.google.com/a.html');
await cluster.queue('http://www.google.com/b.html');
await cluster.queue('http://www.google.com/c.html');
// many more pages

await cluster.idle();
await cluster.close();
})();

My question is, if a.HTML opens first, then B.Html,c.HTML, will be delayed 20 seconds to open it?
Do not understand how this sameDomainDelay uses

Change license to MIT

This software seems really interesting and useful.

Do you have any plans on changing your open source license from the GNU General Public License 3.0 to something else, such as Apache License 2.0, BSD or MIT?

I'm asking since many individuals and organisations cannot use GPL-licensed software. Thanks.

Listening to a Queue from Redis or Others

Anyway for it to listen to some sort of queue for work

how can do page turning

i need page turning,how can do it?

Error "Navigation failed because browser has disconnected!"

This error is happening. ~~Puppeteer version 1.9 seems to be the reason.~~

Edit: I just wrote crappy code (forgot an await). Fix is on the way..

Add more events

Something like:

    cluster.on('monitor', (data) => {
        console.log(data);
    });

Browser closes during debugging

Hello,

I have got few questions, not sure if should have created multiple issues.

Question: Using below code (example code), when I am debugging, browser window closes suddenly not letting me finish stepping through my code. Am I missing any config option?

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    puppeteerOptions: {
      headless: false,
      devtools: true,
      defaultViewport: {
        width: 1920,
        height: 1080
      }
    }
  });

await cluster.queue('www.example.com', main);

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

thanks

Cannot find module ../dist

I'm trying to puppeteer-cluster with minimal.js example. I'm getting the following error:

Windows 7
node: v10.15.à
npm: v6.4.1

D:\Developpement\NodeJS\minimal>node minimal.js
internal/modules/cjs/loader.js:583
throw err;
^

Error: Cannot find module '../dist'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
at Function.Module._load (internal/modules/cjs/loader.js:507:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (D:\Developpement\NodeJS\minimal\minimal.js:1:83)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)

With my configuration the directory ../dist does not exist.
I have

24/01/2019 15:10 .
24/01/2019 15:10 ..
24/01/2019 15:10 minimal
24/01/2019 15:04 node_modules

I replace const { Cluster } = require('../dist'); by const { Cluster } = require('puppeteer-cluster'); It's OK.

Minor type checking improvement for cluster.queue method

First of all, thank you for your work!
I have a minor suggestion on improving type checking fot the cluster.queue() method.
Now we have this:

public async queue(
        data: JobData | TaskFunction,
        taskFunction?: TaskFunction,
    ): Promise<void> {
...
}

As one can inspect, JobData is of type any and it is used both as a first argument to cluster.queue() method as well as the data property of the TaskFunctionArguments interface. This approach does not provide sufficient type checking when we call the cluster.queue() method with two arguments. I'd suggest to use generic types here like this:

type QueueFunction<T> = (arg: QueueFunctionArguments<T>) => Promise<void>;

interface QueueFunctionArguments<T> {
  page: puppeteer.Page;
  data: T;
  worker: {
    id: number;
  };
}

public async queue<T>(
    data: T | TaskFunction,
    taskFunction?: QueueFunction<T>,
): Promise<void> {
...
}

Long-term runs of puppteer-cluster

I'm gonna document some puppeteer-cluster test runs, to see how the different concurrency types and options work together.

Feel free to add your own runs

Program hang when maxConcurrency is set over 50

First of all, great project!
I tried the example while set maxConcurrency = 50/100. What I noticed is when set to 100, pretty much every time the program will hang somewhere. When set to 50, program will hang sometimes. Not sure what caused this issue. Thanks for any input.

why `await`?

Cool project, but I am confused that why you use await in your example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });
  
  // Is `await` necessary?
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // Store screenshot, do something else
  });

  await cluster.queue('http://www.google.com/');
  await cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

when you define a task or try to add some queues, why await? I try to remove them and is ok to do that.

Inter Process Communication

Hello There.
I have multiple instances of puppeteer that scrapes data from some sites. after scraping, each instance uses process.send() to output the data so that it can be saved to a database. I would love to know if it's possible to listen to data/message sent by each instance so that they can be saved to the DB same way we have cluster.on('taskerror') event handler and how to implement it. Regards.

Improve error documentation or maybe even think about catching "stupid" errors

Currently the library does not catch asynchronously thrown errors. That means code like this can lead to errors:

page.on('dialog', async dialog => {
  await dialog.dismiss();
});

The correct way right now is to put a try catch block around the code inside the function. This is a problem, as the library might still come to a stop when the code is badly written.

Option 1: Improve documentation regarding asynchronous errors.
Option 2: Use something like process.on('uncaughtException') and/or process.on('unhandledRejection') to handle all kind of errors. This might interfere with bigger applications that have this kind of handling already build in.

Note sure which one is the way to go. Open for ideas and opinions.

MaxListenersExceededWarning: Possible EventEmitter memory leak detected

Just came across a memory leak issue which hangs the program after some time.

concurrency: Cluster.CONCURRENCY_BROWSER,
workerCreationDelay: 200,
maxConcurrency: 20

about CONCURRENCY_PAGE

There is such a scene,I have a number of URLs, want to open the bulk of parallel, but if the URL is under the same domain name, each page needs to delay a few seconds to open, to avoid being blocked by the target webmaster,How to do it, I follow the following settings do not seem to

concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 10,
retryLimit: 5,//失败重试5次
retryDelay: 2000,//重试间隔2秒
sameDomainDelay:30*1000,//统一域名下，延时10秒打开，貌似没用
skipDuplicateUrls: true,//跳过重复url
workerCreationDelay: 500,//标签打开延时

Is it possible to create task queue dynamically

I want to create a node server for scrapping using puppeteer (pass search term in GET request to scrap google search results)

currently my server is not able process more then 5 parallel request after its goes out of memory

Does puppeteer-cluster have "worker_index" to work with Task Function?

I'm looking for an index of worker that's simply a number of current worker calling Task Function

This code below always captures a screenshot to the same file screen.png.

const puppeteer = require('puppeteer-core');

const {
    Cluster,
} = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
        puppeteer,
        puppeteerOptions: {
            executablePath: 'C:\\Users\\..\\AppData\\Local\\Google\\Chrome SxS\\Application\\chrome.exe',
        },
    });

    await cluster.task(async ({
        page,
        data: url,
    }) => {
        await page.goto(url);
        await page.screenshot({
            path: './screen.png',
        });
        // Store screenshot, do something else
    });

    await cluster.queue('http://www.google.com/');
    await cluster.queue('http://www.wikipedia.org/');
    // many more pages

    await cluster.idle();
    await cluster.close();
})();

I want something like:

await cluster.task(async ({
    page,
    data: url,
    wIndex,
}) => {
    await page.goto(url);
    await page.screenshot({
        path: `./screen_${wIndex}.png`,
    });
    // Store screenshot, do something else
});

With wIndex is a number of current Worker.

Simple solution for this example can be done by using URL of the current queue (https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/minimal.js)

But what if it working with the same URL for each queue?

P/s: Also I want to launch puppeteer with the different launch.Options on each Worker