Giter Site home page Giter Site logo

apify / crawlee Goto Github PK

View Code? Open in Web Editor NEW
12.1K 93.0 509.0 119.53 MB

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev

License: Apache License 2.0

JavaScript 4.48% HTML 0.01% CSS 0.37% TypeScript 52.75% Dockerfile 0.65% MDX 41.75%
web-scraping web-crawling npm headless-chrome puppeteer automation apify scraping crawling crawler headless scraper web-crawler javascript nodejs playwright typescript

crawlee's Introduction

Crawlee
A web scraping and browser automation library

NPM latest version Downloads Chat on discord Build Status

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler
cd my-crawler
npm start

Manual installation

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It's not bundled with Crawlee to reduce install size.

npm install crawlee playwright
import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

By default, Crawlee stores data to ./storage in the current working directory. You can override this directory via Crawlee configuration. For details, see Configuration guide, Request storage and Result storage.

🛠 Features

  • Single interface for HTTP and headless browser crawling
  • Persistent queue for URLs to crawl (breadth & depth first)
  • Pluggable storage of both tabular data and files
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling and retries
  • Dockerfiles ready to deploy
  • Written in TypeScript with generics

👾 HTTP crawling

  • Zero config HTTP2 support, even for proxies
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Integrated fast HTML parsers. Cheerio and JSDOM
  • Yes, you can scrape JSON APIs as well

💻 Real browser crawling

  • JavaScript rendering and screenshots
  • Headless and headful support
  • Zero-config generation of human-like fingerprints
  • Automatic browser management
  • Use Playwright and Puppeteer with the same interface
  • Chrome, Firefox, Webkit and many others

Usage on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

crawlee's People

Contributors

aarontjdev avatar andreybykov avatar b4nan avatar barjin avatar cybairfly avatar davidjohnbarton avatar dependabot-preview[bot] avatar dependabot[bot] avatar drobnikj avatar fnesveda avatar foxt451 avatar gippy avatar janbuchar avatar jancurn avatar jbartadev avatar metalwarrior665 avatar mnmkng avatar mstephen19 avatar mtrunkat avatar mvolfik avatar novotnyj avatar petrpatek avatar pocesar avatar renovate-bot avatar renovate[bot] avatar souravjain540 avatar strajk avatar szmarczak avatar vbartonicek avatar vladfrangu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawlee's Issues

Move build files to top-level dir for NPM publish

Update publish.sh to move files from build to the top-level package directory, so that we can use:

const utils = require('apify/utils');

rather than:

const utils = require('apify/build/utils');

If PuppeteerCrawler fails to close Page, a successful page is considered failed and repeated

This error happens in actor apify/har-files-for-url-list. The code in handlePageFunction normally succeeds, but then the page.close() PuppeteerCrawler code fails with Error: Protocol error (Target.closeTarget): Target closed, and the page is reclaimed and recrawled again. IMHO once handlePageFunction finishes, we should consider request as succeeded, no matter in which state the Page object is.

2018-07-14T18:10:28.313Z HAR of http://www.apify.com/ saved successfully.
2018-07-14T18:10:28.325Z WARNING: PuppeteerPool: browser is retired already {"id":1}
2018-07-14T18:10:28.373Z ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"http://www.apify.com/","retryCount":1}
2018-07-14T18:10:28.375Z {"name":"Error","message":"Protocol error (Target.closeTarget): Target closed.","stack":"Error: Protocol error (Target.closeTarget): Target closed.\n    at Promise (/home/myuser/node_modules/puppeteer/lib/Connection.js:86:56)\n    at new Promise (<anonymous>)\n    at Connection.send (/home/myuser/node_modules/puppeteer/lib/Connection.js:85:12)\n    at Page.close (/home/myuser/node_modules/puppeteer/lib/Page.js:888:38)\n    at _bluebird2.default.race.finally (/home/myuser/node_modules/apify/build/puppeteer_crawler.js:257:50)\n    at PassThroughHandlerContext.finallyHandler (/home/myuser/node_modules/bluebird/js/release/finally.js:56:23)\n    at PassThroughHandlerContext.tryCatcher (/home/myuser/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/home/myuser/node_modules/bluebird/js/release/promise.js:512:31)\n    at Promise._settlePromise (/home/myuser/node_modules/bluebird/js/release/promise.js:569:18)\n    at Promise._settlePromise0 (/home/myuser/no... [line-too-long]
2018-07-14T18:10:28.384Z INFO: Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"],"headless":true,"proxyUrl":"http://groups-BUYPROXIES68277:<redacted>@172.31.16.10:8011/"}

Unhandled promise rejection warning in unit tests

Unit tests show the following warnings:

    Apify.openRequestQueue
      ✓ should open a local request queue when process.env[ENV_VARS.LOCAL_EMULATION_DIR] is set
---------------------------------------------------------------------
------------- WARNING: Unhandled promise rejection !!!! -------------
---------------------------------------------------------------------
{ Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled'
  cause: 
   { Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled'
     errno: -2,
     code: 'ENOENT',
     syscall: 'scandir',
     path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled' },
  isOperational: true,
  errno: -2,
  code: 'ENOENT',
  syscall: 'scandir',
  path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled' }
      ✓ should reuse cached request queue instances (105ms)
      ✓ should open default request queue when queueIdOrName is not provided
      ✓ should open remote queue when process.env[ENV_VARS.LOCAL_EMULATION_DIR] is NOT set
---------------------------------------------------------------------
------------- WARNING: Unhandled promise rejection !!!! -------------
---------------------------------------------------------------------
{ Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled'
  cause: 
   { Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled'
     errno: -2,
     code: 'ENOENT',
     syscall: 'scandir',
     path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled' },
  isOperational: true,
  errno: -2,
  code: 'ENOENT',
  syscall: 'scandir',
  path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled' }

Remove info about AutoscaledPool failed from log

Hide from log this kind of information, so we don't scare users so much.

2018-07-03T09:20:35.543Z ERROR: AutoscaledPool._autoscale() function failed {}
2018-07-03T09:21:33.653Z {“name”:“Error”,“message”:“ESRCH: no such process, read”,“stack”:“Error: ESRCH: no such process, read”,“errno”:-3,“code”:“ESRCH”,“syscall”:“read”}

await Apify.openRequestQueue( only [a-z0-9-]+ ) --> not explicit on local?

await Apify.openRequestQueue( Math.random().toString() );

This error is thrown in Actor, but not on local:

`await Apify.openRequestQueue( Math.random().toString() );

This error is thrown in Actor, but not on local:

2018-06-18T10:01:21.456Z User function threw an exception:
2018-06-18T10:01:21.457Z ApifyError: Invalid value provided: Name can only contain letters 'a' through 'z', the digits '0' through '9', and the hyphen ('-') but only in the middle of the string (e.g. 'my-value-1')
2018-06-18T10:01:21.459Z at exports.newApifyErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:82:12)
2018-06-18T10:01:21.461Z at Request._request2.default.(anonymous function) [as _callback] (/home/myuser/node_modules/apify-client/build/utils.js:157:50)
2018-06-18T10:01:21.462Z at Request.self.callback (/home/myuser/node_modules/request/request.js:185:22)
2018-06-18T10:01:21.464Z at emitTwo (events.js:126:13)
2018-06-18T10:01:21.465Z at Request.emit (events.js:214:7)
2018-06-18T10:01:21.467Z at Request. (/home/myuser/node_modules/request/request.js:1157:10)
2018-06-18T10:01:21.469Z at emitOne (events.js:116:13)
2018-06-18T10:01:21.470Z at Request.emit (events.js:211:7)
2018-06-18T10:01:21.472Z at IncomingMessage. (/home/myuser/node_modules/request/request.js:1079:12)
2018-06-18T10:01:21.473Z at Object.onceWrapper (events.js:313:30)
2018-06-18T10:01:21.475Z at emitNone (events.js:111:20)
2018-06-18T10:01:21.476Z at IncomingMessage.emit (events.js:208:7)
2018-06-18T10:01:21.478Z at endReadableNT (_stream_readable.js:1064:12)
2018-06-18T10:01:21.479Z at _combinedTickCallback (internal/process/next_tick.js:138:11)
2018-06-18T10:01:21.481Z at process._tickCallback (internal/process/next_tick.js:180:9)

`

Add persistStateKey to RequestList

This should simplify the usage for RequestList:

const requestList = new Apify.RequestList({
    sources: [ ... ],
    persistStateKey: 'my-request-list-state',
});

await requestList.initialize();

Proxy configuration redesign + smarter function Apify.getProxy()

Currently we have Apify.getApifyProxy() function, but it doesn't do much, it only compiles a string.

It would be good to have a new async function called Apify.getProxyInfo() with the same parameters, which would do the following:

  • If APIFY_PROXY_PASSWORD is not set but APIFY_TOKEN is, it would obtain the proxy password via API. This will make it easier for people to use it locally (less configuration).
  • Check that user has access to Apify Proxy (e.g. that trial is not expired), and throw user-friendly error if not. Otherwise the error might be very obscured.
  • Check that user has a right to access the selected proxy groups, throw user-friendly error if not.

The function could return an object that is compatible with the result of Node.js URL parsing (https://nodejs.org/api/url.html#url_url_strings_and_url_objects). Some people don't need just the URL and letting them to parse the result is awkward.

{
  proxyUrl: 'http://groups-BLABLA,session-123:[email protected]:8000',
  hostname: 'proxy.apify.com',
  port: 8000,
  auth: 'groups-BLABLA,session-123:my_password',
  username: 'groups-BLABLA,session-123',
  password: 'my_password',
  protocol: 'http',
  ...
}

But maybe it's okay to call the function Apify.getProxyUrl(). People can always call this:

const proxyUrl = new URL(await Apify.getProxyUrl());
console.log(proxyUrl.hostname);

or

const url = require('url');

...

const proxyUrl = url.parse(await Apify.getProxyUrl());
console.log(proxyUrl.hostname);

What are your thoughts?

Integrate Chrome debuger to live view

If you start Chrome this way then Chrome debugger will start at port 9222.

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=remote-profile


First, we need to test it on Apify platform that it runs correctly when opened at live container port and displayed in an iframe.


The second step will be to integrate it to live view. Each instance of chrome/chromium will have its debugger running on a different port and live view will have to proxy connections from different paths at live container port to debugger instances.

Apify.call() cannot be called twice with the same options object

const opts = {};

Apify.call(actId, input, opts);
Apify.call(actId, input, opts); // this call fails synchronously

Fails on any call made after the first one using the same opts object. Most probably there are some modifications done to the opts object down the line in the function which prevents using it again in the second call.

Add PuppeteerPool.retire(browser) method

Use case:

I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in handlePageFunction or gotoFunction that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.

Workaround:

The easiest way is to kill a browser using browser.close() and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.

Proper implementation:

Add puppeteerPool.retire(browser) or browser.retire() method. And in the case mantioned above call:

handlePageFunction({ browser, puppeteerPool, request }) {
   ...
   puppeteerPool.retire(browser);
   throw new Error('Request was blocked!');
   ...
}

BUG: Apify.setValue() throws on keys that include slashes

Attempting to save to KeyValueStore using a key containing / will fail since the slash will be interpreted literally as a path separator.

ENOENT: no such file or directory, open '/Users/.../apify_local/key-value-stores/default/http:/partiwa-adiputra.com/invoice/nD.json

Add web server providing live view of Puppeteer instances at port 1234

Motivation

Currently act run console contains following tabs providing info about run and its default storages:

We wan't to add one more tab "Live view" that enables act to provide own custom information page. Main use case we currently have for this is to provide live view of Puppeteer browser windows. This way users will get better insights in whats happing during the act run.

Implementation at Apify platform

If act starts a http web server at port 1234 then Apify platform will add tab with iframe displaying its content.

Implementation in SDK

If Puppeteer instance get started using Apify.launchPuppeteer() then start a web server providing html page with live view - stream from that browser.

If multiple instances get started then the index page should contain a list of running browsers linked to their live view.

To simplify the first version - replace live view of browser with screenshot + html code updated in regular interval.

Create utility function downloadListOfUrls

This function should look like:

downloadListOfUrls({ url, encoding, offset, limit, skipDuplicates, urlRegExp })

and it should return a promise resolving to an array of URLs. The function should preserve the order of URLs.

This function is useful for many crawling use cases, so it's worth having in utils. RequestList should use it internally too.

Better shutdown behavior on MIGRATION event

If MIGRATION event is received then we should first abort the crawler and then persist the request list. Otherwise, there might be duplicate requests in dataset caused by inconsistency of dataset and request list.

Apify.events unit test randomly fails

The test case "should send persist state events in regular interval" fails about 50% of time, depending on the system workload. The unit test should be deterministic.

Add default environment variables needed for local development to SDK

Currently, many of environment variables such as APIFY_DEFAULT_REQUEST_QUEUE_ID or APIFY_LOCAL_EMULATION_DIR are set by apify-cli and therefore to run SDK scripts without CLI, one must manually set those (often more than 5) env vars.

We should move the defaults to SDK so that users are not forced to use CLI or manually define the env vars.

Pidusage package breaks Travis builds with >2.0.10

The problem is that we can't use Docker environment because we need sudo to be able to install Chrome, etc.

It's temporarily fixed by fixing pidusage to version 2.0.9

    it('throws correct error message when process not found', () => {
        const NONEXISTING_PID = 9999;
        const promise = pidusage(NONEXISTING_PID);

        return expect(promise).to.be.rejectedWith(utils.PID_USAGE_NOT_FOUND_ERROR);
    });
4) pidusage NPM package throws correct error message when process not found:
      AssertionError: expected promise to be rejected with an error including 'No maching pid found' but got 'Invalid path'
      + expected - actual
      -Invalid path
      +No maching pid found

Keep functions strictly synchronous or asynchronous

We have functions throughout the codebase that may end both in a synchronous and asynchronous manner, such as requestList.initialize():

initialize() {
        if (this.isLoading) {
            throw new Error('RequestList sources are already loading or were loaded.');
        }
        this.isLoading = true;

        return Promise
            .mapSeries(this.sources, (source) => { ... }

Best practices suggest that one should avoid such contracts, because error handling and API consumption becomes unpredictable.

I suggest enforcing strict sync / async contracts. A good time to fix would be when migrating the codebase to async/await syntax since it removes some of the pains by itself.

Feature idea: Add exponential back-off AutoscaledPool

It would be great to have an optional flag for AutoscaledPool that would ensure that if there are errors, the concurrency of the pool will be reduced. That's because if the pool performs some operation that might overload a target website or any other resource (e.g. API returning rate limiting errors), we should reduce the load to ensure some work gets done.

new Apify.PuppeteerCrawler did not work with useApifyProxy and apifyProxyGroups

I can't launch puppeteer if I used:

const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        launchPuppeteerOptions: {
           useApifyProxy: true,
           apifyProxyGroups: ['myGroup'],
        },
....

ended with error:

Error: Cannot combine "opts.useApifyProxy" with "opts.proxyUrl"!
    at Object.exports.launchPuppeteer (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/node_modules/apify/build/puppeteer.js:152:52)
    at refreshCookies (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/main.js:40:33)
    at Apify.main (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/main.js:73:15)
    at <anonymous>

AutoscaledPool cannot grow anymore

Hi,

I made some experiments with the AutoscaledPool and found that it has issues when the worker function depends on an external source.

There are two mains issues:

  • First issue: when the number of worker in the pool decreases because a worker function returned null, it is unable to grow again.

The following example illustrates it: you initially have 2 worker running, while workers are running you add more workers, but one of them will terminate before data are available, so all the data that remain are run by only 1 worker when we would ecpect it to grow again.

            // words to be processed by the worker
            let words = ['foo', 'bar'];
            // words to be added later for processing
            let moreWords = ['baz', 'qux', 'quux', 'corge', 'grault'];

            // add a new word at a given internval
            let addWords;
            addWords = setInterval(() => {
                words.push(moreWords.shift());
                if (moreWords.length === 0) {
                    clearInterval(addWords)
                }
            }, 900);

            // print words
            const pool = new Apify.AutoscaledPool({
                maxConcurrency: 6,
                minConcurrency: 4,
                workerFunction: () => {

                    let word = words.shift();

                    if (word) {
                        console.log('init ' + word)
                        return new Promise(resolve => {
                            console.log('start ' + word)
                            setTimeout(() => {
                                console.log(word);
                                resolve();
                            }, 1000)
                        })
                    } else {
                        return null;
                    }

                },
            });

            await pool.run();

To address this issue, the simplerst solution I can imagine would be to have the ability to notice the pool that new data are available for worker function and so that the pool tries to grow again. That could be achieved by a simple method call example: pool.tryAddWorker() or by having a callback function that would be looped and tested when workers are waiting, when it returns true a new worker is added (unless maxConcurency is reached) like that:

            new Apify.AutoscaledPool({
                maxConcurrency: 6,
                minConcurrency: 4,
                workerFunction: () => { /**/ }
                hasData: () => { return words.length > 0; }

  • Second issue: if the pool completed but the external data source still has data to process then the pool will stop and remaining data will never be processed.

To reproduce this issue, use the same code as above, but make the time in setInterval greater than the timeout in the promise.

To adresse this issue I could imagine to have the ability to flag the pool as "non ready to stop", in which case it would stay alive and will wait for an opportunity to grow until it's not flagged anymore as "non ready to stop".

Make ProxyChain usable externally

If we want to use ProxyChain e.g. with Chrome CDP, it's too painful now. Make ProxyChain more generic and document it so that it can be generally used.

Add process monitoring to Live View

For many crawling use cases it would be great to be able to see how much resources each process running in the actor container consumes. Something like what linux command top does. In the first version, it could only be a list of processes refreshed e.g. every second. In the next version, we can make a graph out of it.

BTW the SDK is already using some process monitoring for AutoScaledPool, we might reuse it.

Fix AutoscaledPool implementation

The current implementation of AutoscaledPool has several important problems that need to be fixed.

  • The _autoscale() function is called using setInterval, so if system is under load, many interval events accumulate, _autoscale() runs perpetually and everything chokes. In particular since getMemoryInfo spawn another process, it might take a lot of time. We should use betterSetInterval from apify-shared-js for this (please write missing unit test for it!)
  • Since _autoscale() function might not be called in regular intervals, its functionality cannot depend on the interval constants (SCALE_UP_INTERVAL/SCALE_DOWN_INTERVAL), but it needs to work purely with time-based constants. Btw changing the constants now causes the unit tests to fail...
  • The interval for _autoscale() should be more like one second rather than 200 milliseconds, to reduce overheads
  • The tracking of overloaded CPU should use blocked NPM package in addition to cpuInfo event on platform. There should be no assumption about the periodicity of these events, so we should only store their last value and time when they we received, and use our own sampling in a fixed interval (e.g. once per second) to create isCpuOverloadedSnapshots. This way we know what the data isCpuOverloadedSnapshots mean.
  • Use async/await instead of promise chains
  • Make auto-scaling work even on local machine, not only on Apify platform!!!
  • The scaling logic should be simplified, so that it can be easily described to the users. We need to discuss how exactly, but we can take inspiration e.g. from AWS autoscaling
  • Maybe replace ignoreMainProcess with ignoreChildProcesses - I think the main process should always be considered, since adding new tasks (e.g. Puppeteer instances) will always add some resources ot. It would make more sense to add a flag ignoreChildProcesses, e.g. if the caller knows that their pool tasks are not spawning any processes.

Enable getValue/setValue primitives for names key-value stores

Maybe we should have a function like Apify.openKeyValueStore(nameOrId), which will open or create a new key-value store and return an object that will have getValue and setValue functions with same logic as the Apify object has. This will make it much easier to use named key-value stores from acts.

AutoscaledPool doesn't work locally

Issues:

  • Mac usually says that there is not enough memory however it can be emptied
  • There is enough memory at Windows but we are not scaling based on CPU so it overloads CPU

Solution:

  • Investigate Mac once again
  • Compute used memory only based on node process and child processes only
  • Create CPU overloaded event emitter as we have at Apify platform

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.