apify / crawlee Goto Github PK

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Home Page: https://crawlee.dev

License: Apache License 2.0

JavaScript 4.48% HTML 0.01% CSS 0.37% TypeScript 52.75% Dockerfile 0.65% MDX 41.75%

web-scraping web-crawling npm headless-chrome puppeteer automation apify scraping crawling crawler headless scraper web-crawler javascript nodejs playwright typescript

crawlee's Introduction

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler

cd my-crawler
npm start

Manual installation

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It's not bundled with Crawlee to reduce install size.

npm install crawlee playwright

import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

By default, Crawlee stores data to ./storage in the current working directory. You can override this directory via Crawlee configuration. For details, see Configuration guide, Request storage and Result storage.

🛠 Features

Single interface for HTTP and headless browser crawling
Persistent queue for URLs to crawl (breadth & depth first)
Pluggable storage of both tabular data and files
Automatic scaling with available system resources
Integrated proxy rotation and session management
Lifecycles customizable with hooks
CLI to bootstrap your projects
Configurable routing, error handling and retries
Dockerfiles ready to deploy
Written in TypeScript with generics

👾 HTTP crawling

Zero config HTTP2 support, even for proxies
Automatic generation of browser-like headers
Replication of browser TLS fingerprints
Integrated fast HTML parsers. Cheerio and JSDOM
Yes, you can scrape JSON APIs as well

💻 Real browser crawling

JavaScript rendering and screenshots
Headless and headful support
Zero-config generation of human-like fingerprints
Automatic browser management
Use Playwright and Puppeteer with the same interface
Chrome, Firefox, Webkit and many others

Usage on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

crawlee's People

Contributors

Stargazers

Watchers

Forkers

0xjgv gdunghi littlelotta eitlerpereira mnmkng buffolander 914955868 nishith xpilot cscg shubhank-saxena francishero elijahlynn godhelpjun nan0pr0be1 m8e kiran-mysore ivan801 cuulee techrec13 santoshsrinivas79 biomazi ddofborg zubbyik yknx4 celsobessa cube3power shaunstanislauslau nhokcrazy199 williamtran29 haiyangmin mhstrongman heartnett jdrew1303 slimui yarikos wa2do ceewick ro9ueadmin xinchenrealjaja alihaddadkar pacohams arnasledev mikepsinn ganny26 omunroe-com lovewitty aarontjdev jachanga ryanardo miguelramosfdz it950 xctest simplec9000 mkvtvseries dominic-pagan huoyijie mponizil flickz spatocode solertis novotnyj trungtv missjennie mjunaidi yeekzhang mangotalk tulaneadam wbeck32 zorrock vanbungkring alirezaskh beeirl jicius reynaldiaznan123 vdrmota xierui921326 anthonyiaf davidturnbull 38438-38438-org c0bra elaichenkov willrimes74 raymondseger meobanak rob-rychs xorava pgswinter pizzashift multi-scope idarbuashvili hgf8849 salvf remusao stjordanis dgamepuzzle numan youhan26 yin eimlav

crawlee's Issues

Add parameters maxPages and maxDepth to Puppeteer and Basic cralers

Move build files to top-level dir for NPM publish

Update publish.sh to move files from build to the top-level package directory, so that we can use:

const utils = require('apify/utils');

rather than:

const utils = require('apify/build/utils');

Apify.call() is missing an option to set up memory

Log messages should be in plain text, not in JSON

The logs are for people, so we should log in text, rather than JSON. E.g. in the form:

LOG_LEVEL: Log message goes here ({ "log data here": 123})

Rename pageOpsTimeoutMillis to handlePageTimeoutSecs

To make it more clear what is it and what it does, now it's bit confusing. Support the old name for backward compatibility and just print a single deprecation warning.

If PuppeteerCrawler fails to close Page, a successful page is considered failed and repeated

This error happens in actor apify/har-files-for-url-list. The code in handlePageFunction normally succeeds, but then the page.close() PuppeteerCrawler code fails with Error: Protocol error (Target.closeTarget): Target closed, and the page is reclaimed and recrawled again. IMHO once handlePageFunction finishes, we should consider request as succeeded, no matter in which state the Page object is.

2018-07-14T18:10:28.313Z HAR of http://www.apify.com/ saved successfully.
2018-07-14T18:10:28.325Z WARNING: PuppeteerPool: browser is retired already {"id":1}
2018-07-14T18:10:28.373Z ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"http://www.apify.com/","retryCount":1}
2018-07-14T18:10:28.375Z {"name":"Error","message":"Protocol error (Target.closeTarget): Target closed.","stack":"Error: Protocol error (Target.closeTarget): Target closed.\n    at Promise (/home/myuser/node_modules/puppeteer/lib/Connection.js:86:56)\n    at new Promise (<anonymous>)\n    at Connection.send (/home/myuser/node_modules/puppeteer/lib/Connection.js:85:12)\n    at Page.close (/home/myuser/node_modules/puppeteer/lib/Page.js:888:38)\n    at _bluebird2.default.race.finally (/home/myuser/node_modules/apify/build/puppeteer_crawler.js:257:50)\n    at PassThroughHandlerContext.finallyHandler (/home/myuser/node_modules/bluebird/js/release/finally.js:56:23)\n    at PassThroughHandlerContext.tryCatcher (/home/myuser/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/home/myuser/node_modules/bluebird/js/release/promise.js:512:31)\n    at Promise._settlePromise (/home/myuser/node_modules/bluebird/js/release/promise.js:569:18)\n    at Promise._settlePromise0 (/home/myuser/no... [line-too-long]
2018-07-14T18:10:28.384Z INFO: Launching Puppeteer {"args":["--no-sandbox","--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"],"headless":true,"proxyUrl":"http://groups-BUYPROXIES68277:<redacted>@172.31.16.10:8011/"}

Unhandled promise rejection warning in unit tests

Unit tests show the following warnings:

    Apify.openRequestQueue
      ✓ should open a local request queue when process.env[ENV_VARS.LOCAL_EMULATION_DIR] is set
---------------------------------------------------------------------
------------- WARNING: Unhandled promise rejection !!!! -------------
---------------------------------------------------------------------
{ Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled'
  cause: 
   { Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled'
     errno: -2,
     code: 'ENOENT',
     syscall: 'scandir',
     path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled' },
  isOperational: true,
  errno: -2,
  code: 'ENOENT',
  syscall: 'scandir',
  path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-2/handled' }
      ✓ should reuse cached request queue instances (105ms)
      ✓ should open default request queue when queueIdOrName is not provided
      ✓ should open remote queue when process.env[ENV_VARS.LOCAL_EMULATION_DIR] is NOT set
---------------------------------------------------------------------
------------- WARNING: Unhandled promise rejection !!!! -------------
---------------------------------------------------------------------
{ Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled'
  cause: 
   { Error: ENOENT: no such file or directory, scandir '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled'
     errno: -2,
     code: 'ENOENT',
     syscall: 'scandir',
     path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled' },
  isOperational: true,
  errno: -2,
  code: 'ENOENT',
  syscall: 'scandir',
  path: '/Users/jan/Projects/apify-js/tmp/local-emulation-dir/request-queues/some-id-4/handled' }

Remove info about AutoscaledPool failed from log

Hide from log this kind of information, so we don't scare users so much.

2018-07-03T09:20:35.543Z ERROR: AutoscaledPool._autoscale() function failed {}
2018-07-03T09:21:33.653Z {“name”:“Error”,“message”:“ESRCH: no such process, read”,“stack”:“Error: ESRCH: no such process, read”,“errno”:-3,“code”:“ESRCH”,“syscall”:“read”}

await Apify.openRequestQueue( only [a-z0-9-]+ ) --> not explicit on local?

await Apify.openRequestQueue( Math.random().toString() );

This error is thrown in Actor, but not on local:

`await Apify.openRequestQueue( Math.random().toString() );

This error is thrown in Actor, but not on local:

2018-06-18T10:01:21.456Z User function threw an exception:
2018-06-18T10:01:21.457Z ApifyError: Invalid value provided: Name can only contain letters 'a' through 'z', the digits '0' through '9', and the hyphen ('-') but only in the middle of the string (e.g. 'my-value-1')
2018-06-18T10:01:21.459Z at exports.newApifyErrorFromResponse (/home/myuser/node_modules/apify-client/build/utils.js:82:12)
2018-06-18T10:01:21.461Z at Request._request2.default.(anonymous function) [as _callback] (/home/myuser/node_modules/apify-client/build/utils.js:157:50)
2018-06-18T10:01:21.462Z at Request.self.callback (/home/myuser/node_modules/request/request.js:185:22)
2018-06-18T10:01:21.464Z at emitTwo (events.js:126:13)
2018-06-18T10:01:21.465Z at Request.emit (events.js:214:7)
2018-06-18T10:01:21.467Z at Request. (/home/myuser/node_modules/request/request.js:1157:10)
2018-06-18T10:01:21.469Z at emitOne (events.js:116:13)
2018-06-18T10:01:21.470Z at Request.emit (events.js:211:7)
2018-06-18T10:01:21.472Z at IncomingMessage. (/home/myuser/node_modules/request/request.js:1079:12)
2018-06-18T10:01:21.473Z at Object.onceWrapper (events.js:313:30)
2018-06-18T10:01:21.475Z at emitNone (events.js:111:20)
2018-06-18T10:01:21.476Z at IncomingMessage.emit (events.js:208:7)
2018-06-18T10:01:21.478Z at endReadableNT (_stream_readable.js:1064:12)
2018-06-18T10:01:21.479Z at _combinedTickCallback (internal/process/next_tick.js:138:11)
2018-06-18T10:01:21.481Z at process._tickCallback (internal/process/next_tick.js:180:9)

PuppeteerPool should kill opened puppeteer subprocesses on SIGINT signal

Currently when user kill local run with ctrl+c the main process gets killed but browsers are still running.

Rename APIFY_LOCAL_EMULATION_DIR to APIFY_LOCAL_STORAGE_DIR

Or maybe just APIFY_STORAGE_DIR. The point is that the word "emulation" might be bit confusing to users.

Add persistStateKey to RequestList

This should simplify the usage for RequestList:

const requestList = new Apify.RequestList({
    sources: [ ... ],
    persistStateKey: 'my-request-list-state',
});

await requestList.initialize();

Apify.getValue() didn't recognise file without file extension

I set up all right except file extension and I get null from Apify.getValue() method.

const input = await Apify.getValue('INPUT');
console.log(input) // null

We can add fallback(json? plain text?), when no extension for file is set.

Errors in handlePageFunction are not reported

If handlePageFunction throws an error, the PuppeteerPool silently consumes it, I think we should print it to console log.

Local implementations of KeyValueStore, Dataset and RequestQueue should be overridable

Currently there's no easy way to allow use of remote KeyValueStore Dataset and RequestQueue while running locally, or even combining local implementations with remote ones.

We should add an option to openKeyValueStore to use a remote store (and the same for others).

Add example with form filling and submission

Proxy configuration redesign + smarter function Apify.getProxy()

Currently we have Apify.getApifyProxy() function, but it doesn't do much, it only compiles a string.

It would be good to have a new async function called Apify.getProxyInfo() with the same parameters, which would do the following:

If APIFY_PROXY_PASSWORD is not set but APIFY_TOKEN is, it would obtain the proxy password via API. This will make it easier for people to use it locally (less configuration).
Check that user has access to Apify Proxy (e.g. that trial is not expired), and throw user-friendly error if not. Otherwise the error might be very obscured.
Check that user has a right to access the selected proxy groups, throw user-friendly error if not.

The function could return an object that is compatible with the result of Node.js URL parsing (https://nodejs.org/api/url.html#url_url_strings_and_url_objects). Some people don't need just the URL and letting them to parse the result is awkward.

{
  proxyUrl: 'http://groups-BLABLA,session-123:[email protected]:8000',
  hostname: 'proxy.apify.com',
  port: 8000,
  auth: 'groups-BLABLA,session-123:my_password',
  username: 'groups-BLABLA,session-123',
  password: 'my_password',
  protocol: 'http',
  ...
}

But maybe it's okay to call the function Apify.getProxyUrl(). People can always call this:

const proxyUrl = new URL(await Apify.getProxyUrl());
console.log(proxyUrl.hostname);

const url = require('url');

...

const proxyUrl = url.parse(await Apify.getProxyUrl());
console.log(proxyUrl.hostname);

What are your thoughts?

Apify.call() is lacking timeoutSecs and contains some weird backwards compatibility hacks

The timeoutSecs options should be consistent with timeout param in apify-client-js as well as Apify API. Check that waitSecs usage is correct and remove the old backwards compatibility hacks. Write this to changelog as breaking change!

Integrate Chrome debuger to live view

If you start Chrome this way then Chrome debugger will start at port 9222.

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=remote-profile

First, we need to test it on Apify platform that it runs correctly when opened at live container port and displayed in an iframe.

The second step will be to integrate it to live view. Each instance of chrome/chromium will have its debugger running on a different port and live view will have to proxy connections from different paths at live container port to debugger instances.

Apify.call() cannot be called twice with the same options object

const opts = {};

Apify.call(actId, input, opts);
Apify.call(actId, input, opts); // this call fails synchronously

Fails on any call made after the first one using the same opts object. Most probably there are some modifications done to the opts object down the line in the function which prevents using it again in the second call.

Dataset.getData() doesn't work locally on Windows

Await dataset.getData() returns only empty paginationList. The same code works in Apify Actor.

BUG: JSDoc generates invalid menu

If you open https://www.apify.com/docs/sdk/apify-runtime-js/latest#utils-puppeteer, you will see the menu on the left only shows "puppeteer", but it should show "utils.puppeteer".

Add PuppeteerPool.retire(browser) method

Use case:

I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in handlePageFunction or gotoFunction that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.

Workaround:

The easiest way is to kill a browser using browser.close() and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.

Proper implementation:

Add puppeteerPool.retire(browser) or browser.retire() method. And in the case mantioned above call:

handlePageFunction({ browser, puppeteerPool, request }) {
   ...
   puppeteerPool.retire(browser);
   throw new Error('Request was blocked!');
   ...
}

BUG: Apify.setValue() throws on keys that include slashes

Attempting to save to KeyValueStore using a key containing / will fail since the slash will be interpreted literally as a path separator.

ENOENT: no such file or directory, open '/Users/.../apify_local/key-value-stores/default/http:/partiwa-adiputra.com/invoice/nD.json

Add web server providing live view of Puppeteer instances at port 1234

Motivation

Currently act run console contains following tabs providing info about run and its default storages:

We wan't to add one more tab "Live view" that enables act to provide own custom information page. Main use case we currently have for this is to provide live view of Puppeteer browser windows. This way users will get better insights in whats happing during the act run.

Implementation at Apify platform

If act starts a http web server at port 1234 then Apify platform will add tab with iframe displaying its content.

Implementation in SDK

If Puppeteer instance get started using Apify.launchPuppeteer() then start a web server providing html page with live view - stream from that browser.

If multiple instances get started then the index page should contain a list of running browsers linked to their live view.

To simplify the first version - replace live view of browser with screenshot + html code updated in regular interval.

Create utility function downloadListOfUrls

This function should look like:

downloadListOfUrls({ url, encoding, offset, limit, skipDuplicates, urlRegExp })

and it should return a promise resolving to an array of URLs. The function should preserve the order of URLs.

This function is useful for many crawling use cases, so it's worth having in utils. RequestList should use it internally too.

Apify.pushItem() should split larger arrays to fit under 9MB per API call

If the array stringified to JSON is larger than 9MB, we should split it in two halfs and then recursively call pushItem for each of the halfs of the array. If it's only a single item and is larger than 9MB, then just throw an error.

Better shutdown behavior on MIGRATION event

If MIGRATION event is received then we should first abort the crawler and then persist the request list. Otherwise, there might be duplicate requests in dataset caused by inconsistency of dataset and request list.

Apify.events unit test randomly fails

The test case "should send persist state events in regular interval" fails about 50% of time, depending on the system workload. The unit test should be deterministic.

RequestQueue should keep local cache of known requests to prevent unneeded `addRequest` calls

Add default environment variables needed for local development to SDK

Currently, many of environment variables such as APIFY_DEFAULT_REQUEST_QUEUE_ID or APIFY_LOCAL_EMULATION_DIR are set by apify-cli and therefore to run SDK scripts without CLI, one must manually set those (often more than 5) env vars.

We should move the defaults to SDK so that users are not forced to use CLI or manually define the env vars.

Add info which process consuming memory to Apify.getMemoryInfo()

Function can return info about process only if we set some parameter to it. Because get info about all processes can be demanding task.

Enable run puppeteer crawler without apify proxy

@jancurn @mtrunkat
Can we enable run puppeteer crawler without apify proxy?
Because now when you dont have APIFY_PROXY_PASSWORD env var set, you can't run puppeteer crawler from local.

Pidusage package breaks Travis builds with >2.0.10

The problem is that we can't use Docker environment because we need sudo to be able to install Chrome, etc.

It's temporarily fixed by fixing pidusage to version 2.0.9

    it('throws correct error message when process not found', () => {
        const NONEXISTING_PID = 9999;
        const promise = pidusage(NONEXISTING_PID);

        return expect(promise).to.be.rejectedWith(utils.PID_USAGE_NOT_FOUND_ERROR);
    });

4) pidusage NPM package throws correct error message when process not found:
      AssertionError: expected promise to be rejected with an error including 'No maching pid found' but got 'Invalid path'
      + expected - actual
      -Invalid path
      +No maching pid found

Keep functions strictly synchronous or asynchronous

We have functions throughout the codebase that may end both in a synchronous and asynchronous manner, such as requestList.initialize():

initialize() {
        if (this.isLoading) {
            throw new Error('RequestList sources are already loading or were loaded.');
        }
        this.isLoading = true;

        return Promise
            .mapSeries(this.sources, (source) => { ... }

Best practices suggest that one should avoid such contracts, because error handling and API consumption becomes unpredictable.

I suggest enforcing strict sync / async contracts. A good time to fix would be when migrating the codebase to async/await syntax since it removes some of the pains by itself.

Move docs to their own website

SDK is becoming quite powerful (and large). It would make sense to move docs to their own website to enable search and versioning.

We can use Docusaurus. It works with markdown documents so we would also use jsDocToMarkdown.

Add flag to launchPuppeteer() that will start full Chrome instead of Chromium

There should be some flag like useFullChrome that will set executablePath to process.env.APIFY_CHROME_EXECUTBALE_PATH if it's set, so that people don't need to know where Chrome is installed in the docker image.

Feature idea: Add exponential back-off AutoscaledPool

It would be great to have an optional flag for AutoscaledPool that would ensure that if there are errors, the concurrency of the pool will be reduced. That's because if the pool performs some operation that might overload a target website or any other resource (e.g. API returning rate limiting errors), we should reduce the load to ensure some work gets done.

new Apify.PuppeteerCrawler did not work with useApifyProxy and apifyProxyGroups

I can't launch puppeteer if I used:

const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        launchPuppeteerOptions: {
           useApifyProxy: true,
           apifyProxyGroups: ['myGroup'],
        },
....

ended with error:

Error: Cannot combine "opts.useApifyProxy" with "opts.proxyUrl"!
    at Object.exports.launchPuppeteer (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/node_modules/apify/build/puppeteer.js:152:52)
    at refreshCookies (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/main.js:40:33)
    at Apify.main (/Users/Drobnik/WebstormProjects/actor-acts/acts/crawler-barnebys-co-uk-cars/main.js:73:15)
    at <anonymous>

AutoscaledPool cannot grow anymore

Hi,

I made some experiments with the AutoscaledPool and found that it has issues when the worker function depends on an external source.

There are two mains issues:

First issue: when the number of worker in the pool decreases because a worker function returned null, it is unable to grow again.

The following example illustrates it: you initially have 2 worker running, while workers are running you add more workers, but one of them will terminate before data are available, so all the data that remain are run by only 1 worker when we would ecpect it to grow again.

            // words to be processed by the worker
            let words = ['foo', 'bar'];
            // words to be added later for processing
            let moreWords = ['baz', 'qux', 'quux', 'corge', 'grault'];

            // add a new word at a given internval
            let addWords;
            addWords = setInterval(() => {
                words.push(moreWords.shift());
                if (moreWords.length === 0) {
                    clearInterval(addWords)
                }
            }, 900);

            // print words
            const pool = new Apify.AutoscaledPool({
                maxConcurrency: 6,
                minConcurrency: 4,
                workerFunction: () => {

                    let word = words.shift();

                    if (word) {
                        console.log('init ' + word)
                        return new Promise(resolve => {
                            console.log('start ' + word)
                            setTimeout(() => {
                                console.log(word);
                                resolve();
                            }, 1000)
                        })
                    } else {
                        return null;
                    }

                },
            });

            await pool.run();

To address this issue, the simplerst solution I can imagine would be to have the ability to notice the pool that new data are available for worker function and so that the pool tries to grow again. That could be achieved by a simple method call example: pool.tryAddWorker() or by having a callback function that would be looped and tested when workers are waiting, when it returns true a new worker is added (unless maxConcurency is reached) like that:

            new Apify.AutoscaledPool({
                maxConcurrency: 6,
                minConcurrency: 4,
                workerFunction: () => { /**/ }
                hasData: () => { return words.length > 0; }

Second issue: if the pool completed but the external data source still has data to process then the pool will stop and remaining data will never be processed.

To reproduce this issue, use the same code as above, but make the time in setInterval greater than the timeout in the promise.

To adresse this issue I could imagine to have the ability to flag the pool as "non ready to stop", in which case it would stay alive and will wait for an opportunity to grow until it's not flagged anymore as "non ready to stop".

Apify.setValue doesn't support XML

await Apify.setValue(item.key, itemData.body, {contentType: 'application/xml; charset=utf-8'});

Stores the items on local as buffer.

Make ProxyChain usable externally

If we want to use ProxyChain e.g. with Chrome CDP, it's too painful now. Make ProxyChain more generic and document it so that it can be generally used.

Add process monitoring to Live View

For many crawling use cases it would be great to be able to see how much resources each process running in the actor container consumes. Something like what linux command top does. In the first version, it could only be a list of processes refreshed e.g. every second. In the next version, we can make a graph out of it.

BTW the SDK is already using some process monitoring for AutoScaledPool, we might reuse it.

Fix AutoscaledPool implementation

The current implementation of AutoscaledPool has several important problems that need to be fixed.

The _autoscale() function is called using setInterval, so if system is under load, many interval events accumulate, _autoscale() runs perpetually and everything chokes. In particular since getMemoryInfo spawn another process, it might take a lot of time. We should use betterSetInterval from apify-shared-js for this (please write missing unit test for it!)
Since _autoscale() function might not be called in regular intervals, its functionality cannot depend on the interval constants (SCALE_UP_INTERVAL/SCALE_DOWN_INTERVAL), but it needs to work purely with time-based constants. Btw changing the constants now causes the unit tests to fail...
The interval for _autoscale() should be more like one second rather than 200 milliseconds, to reduce overheads
The tracking of overloaded CPU should use blocked NPM package in addition to cpuInfo event on platform. There should be no assumption about the periodicity of these events, so we should only store their last value and time when they we received, and use our own sampling in a fixed interval (e.g. once per second) to create isCpuOverloadedSnapshots. This way we know what the data isCpuOverloadedSnapshots mean.
Use async/await instead of promise chains
Make auto-scaling work even on local machine, not only on Apify platform!!!
The scaling logic should be simplified, so that it can be easily described to the users. We need to discuss how exactly, but we can take inspiration e.g. from AWS autoscaling
Maybe replace ignoreMainProcess with ignoreChildProcesses - I think the main process should always be considered, since adding new tasks (e.g. Puppeteer instances) will always add some resources ot. It would make more sense to add a flag ignoreChildProcesses, e.g. if the caller knows that their pool tasks are not spawning any processes.

Enable getValue/setValue primitives for names key-value stores

Maybe we should have a function like Apify.openKeyValueStore(nameOrId), which will open or create a new key-value store and return an object that will have getValue and setValue functions with same logic as the Apify object has. This will make it much easier to use named key-value stores from acts.

BUG: RequestList has invalid URL regexp

When it goes through file like:

AAA,BBB,"https://www.example.com/bla,bla"

It identifies URL:

https://www.example.com/bla,bla"

(note the quote). IMHO the parser should ignore unsafe URL characters (https://perishablepress.com/stop-using-unsafe-characters-in-urls/)

AutoscaledPool doesn't work locally

Issues:

Mac usually says that there is not enough memory however it can be emptied
There is enough memory at Windows but we are not scaling based on CPU so it overloads CPU

Solution:

Investigate Mac once again
Compute used memory only based on node process and child processes only
Create CPU overloaded event emitter as we have at Apify platform