Giter Site home page Giter Site logo

apify / apify-sdk-js Goto Github PK

View Code? Open in Web Editor NEW
108.0 7.0 28.0 307.51 MB

Apify SDK monorepo

Home Page: https://docs.apify.com/sdk/js

License: Apache License 2.0

JavaScript 18.71% Dockerfile 0.51% TypeScript 56.42% CSS 0.70% MDX 23.65%
actor apify javascript nodejs sdk typescript

apify-sdk-js's Introduction

Apify SDK monorepo

npm version Downloads Chat on discord Build Status

Apify SDK is the core set of tools and utilities that we've built to help make your interaction with the Apify Platform easier. This monorepo holds all the components and tools that we've created for it!

Would you like to work with us on Crawlee, Apify SDK or similar projects? We are hiring!

package version
apify NPM version

Apify SDK

Apify SDK provides the tools required to run your own Apify Actors! The crawlers and scraping related tools, previously included in Apify SDK (v2), have been split into a brand-new module - crawlee (which you can use outside Apify too!), while keeping the Apify specific parts in this module!

Support

If you find any bug or issue with the Apify SDK, please submit an issue on GitHub. For questions, you can ask on Stack Overflow or contact [email protected]

Contributing

Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

apify-sdk-js's People

Contributors

andreybykov avatar b4nan avatar barjin avatar drobnikj avatar fnesveda avatar foxt451 avatar jancurn avatar jirimoravcik avatar metalwarrior665 avatar mnmkng avatar natashalekh avatar renovate[bot] avatar szmarczak avatar tc-mo avatar vladfrangu avatar zpelechova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

apify-sdk-js's Issues

Warn about SDK vs crawlee version mismatch

Since the SDK depends on crawlee, users might configure their dependencies so those two don't fit each other, which ends up with multiple @crawlee/core installs, which in turn might cause weird issues, since there will be two global configs. This could potentially result in the crawlee storage methods (like Dataset.pushData) working with the memory storage even when on platform.

We should validate this in the Actor.init call, there needs to be a single crawlee installation.

Better CLI integration

Feature

Add the CLI as apify dependency and define the NPM CLI script there too. This way it will be possible to do npx apify create.

Motivation

Right now we either have to install the CLI globally or use npx apify-cli ....

Ideal solution or implementation, and any additional constraints

...

Alternative solutions or implementations

No response

Other context

Crawlee does this already: https://github.com/apify/crawlee/blob/master/packages/crawlee/src/cli.ts

Better error handling and logging for Apify proxy configuration

Feature

We want to change the behavior of the instance of Actor.createProxyConfiguration

Current behavior:

  • throws an error if a user is not logged in (doesn't provide the proxy password)
  • throws an error if the user doesn't have access to the Apify proxy

Suggested change:
We would like to just print a warning to the console instead (or create some safer createProxyConfiguration) on the local environment. The Apify proxy just wouldn't be used at all, the configuration would do nothing.

Also, we proposed adding an additional warning - if the user has access to the Apify proxy but can't use it locally (this must be explained properly - free plan users).

We also propose an improved copy for the original error.

  • It must be clear from the first error user can simply use CLI apify login cmd to get the proxy password.

Motivation

Make Apify templates as seamless as possible. Now, a developer has to think about where it can be run and catch different errors. It is not clear from the error what to do and we have feedback from real users that struggled with this for quite some time.

Ideal solution or implementation, and any additional constraints

Make Actor.createProxyConfiguration with warnings instead of error for local environment.

Alternative solutions or implementations

No response

Other context

No response

Add new options to generic scrapers

Which scraper is the feature request for?

cheerio-scraper

Feature

We have new features in website content crawler that are worth porting to the generic scrapers, namely:

  • exclude globs
  • cookie modal closing
  • infinite scroll

Motivation

.

Ideal solution or implementation, and any additional constraints

.

Alternative solutions or implementations

No response

Other context

No response

Tasks

No tasks being tracked yet.

Better detection of missed `Actor.init` calls

  • if we see use of storages or other API from apify sdk that normally require Actor.init to be called, we should print a warning (or info log) about this (“maybe you forgot to run Actor.init?“)
  • if we see explicit Actor.init called after that happens, we know this is wrong, and we can throw an error about it

https://apifier.slack.com/archives/CA6EBU3CM/p1667554982496199?thread_ts=1667553774.192519&cid=CA6EBU3CM

we already have one PR (#35) that does the automatic init calls, so we could use that as inspiration, the same places should now print a warning (first point), plus we will need a check in the init method (second point)

Docs: ApifyEnv type missing most properties

Issue description

Docs page:
https://sdk.apify.com/api/apify/class/Actor#getEnv

Return value
https://api.apify.com/v2/key-value-stores/emJruom398sXAWxCP/records/ENV

We don't have to type everything but definitely the core things with IDs, anything that is useful for devs

Code sample

No response

Package version

latest

Node.js version

node 16

Operating system

No response

Actor or run link

No response

I have tested this on the next release

No response

Other context

No response

Apify.call should "survive" migration

Right now, Apify.call doesn't save any state upon migration so if the user is not careful, he will loose reference to the called runs and they will be called again.

The current solution is to save the run ID into KV store and do some waiting loop manually.

Ideally Apify.call would do this automatically. Not sure if this can be done without any extra parameters automatically or we the user would have to specify it with a string param.

Provide better guidance about Apify.main()

Seems quite a lot of people end up copy-pasting code and using Apify.main() in situations where it shouldn't be used. We should probably improve the docs somehow or perhaps even get rid of the function altogether, and replace it with e.g. Apify.setup(). Dunno...

https://stackoverflow.com/questions/62357401/express-stops-listening-when-using-puppeteer
https://stackoverflow.com/questions/58543882/is-there-a-way-to-use-apify-main-without-it-exiting-the-node-js-process-on-com
https://stackoverflow.com/questions/56977763/how-to-use-apify-on-google-cloud-functions

Where is the Google Maps Web scrapper?

Which scraper is the feature request for?

puppeteer-scraper

Feature

The code used to be hosted here on github but now its disappeared.

The current link in apify leads to a 404 error.

Apify page: https://apify.com/compass/crawler-google-places#resources-on-how-to-scrape-google-maps

404: https://github.com/drobnikj/crawler-google-places/blob/master/CHANGELOG.md

Motivation

I want to be able to use & contribute to the scrapper

Ideal solution or implementation, and any additional constraints

Make the code public again

Alternative solutions or implementations

No response

Other context

No response

Flaky implementation some of Actor tests

Issue description

Some of the tests for Actor didn't work correctly, the problem is in mocking getValue method on top of defaultStore. The expect calls are in mock function, but the mock function is never called so the test passed even if there is a bug, see https://github.com/apify/apify-sdk-js/blob/master/test/apify/actor.test.ts#L983
Plus the test for Actor.pushData testing setValue method.

I just discover it during #83, but don't have time to fix it.

Code sample

No response

Package version

3.0.2

Node.js version

all

Operating system

No response

Actor or run link

No response

I have tested this on the next release

No response

Other context

No response

Add possibility to configure Headers for Webhooks

Feature

In the past there was added "feature on Apify Platfrom" which allowed to put headers to the Request when adding Webhook integration to the Actor.

But this was not reflected to the SDK yet. There are people asking for it on discord.

Motivation

Keep in sync the possibility of the Platform with the Apify SDK.

Ideal solution or implementation, and any additional constraints

Possibly just add headers parameter - something like:

await Actor.addWebhook({
  eventTypes: ['ACTOR.RUN.SUCCEEDED'],
  requestUrl: process.env.RUN_SUCCEEDED_WEBHOOK_URL,
  idempotencyKey: process.env.APIFY_ACTOR_RUN_ID,
  headers: {
     'Authorization': Bearer ${process.env.APIFY_API_TOKEN},
  }
})

Alternative solutions or implementations

No response

Other context

image

Cleanup and finalize the fail/exit/abort methods

Feature

  • actor.fail() is just a wrapper around actor.exit() providing a default exit code 1, but this behavior is not documented
  • actor.exit(statusMessage) should save the status to the cloud message prior to the process exit using Actor.setStatusMesage with isStatusMessageTerminal set to true
  • actor.abort() is saving status message but without the isStatusMessageTerminal set to true

Motivation

To comply with the specification.

Ideal solution or implementation, and any additional constraints

No response

Alternative solutions or implementations

No response

Other context

No response

Support input schema defaults in `Actor.getInput()`

Currently, the input schema is ignored locally, and to respect the defaults inside it, you need to use the Apify CLI to run the actor, which creates the INPUT.json file based on those defaults when missing.

Implement input schema support on the SDK level, so this works on the fly if the INPUT.json file is missing. If it's there, we should still fill in the missing fields that have defaults in the input schema.

The goal of this PR is to unify the local vs platform behavior. As a nice side effect, there shouldn't be any need to use the Apify CLI to run the actor, as npm start should behave the same with this implemented.

Docs: Some ApiLinks should point to Crawlee

Issue description

CleanShot 2022-09-29 at 09 17 50@2x

e.g. https://github.com/apify/apify-sdk-js/blob/master/docs/examples/cheerio_crawler.mdx?plain=1#L10

Correct destination is
https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler
not
https://sdk.apify.com/api/3.0/cheerio-crawler/class/CheerioCrawler

Code sample

No response

Package version

latest

Node.js version

not relevant

Operating system

No response

Actor or run link

No response

I have tested this on the next release

No response

Other context

otherwise Crawlee is awesome! :yay:

Flaky tests for websockets

After the vitest refactor, sometimes the CI fails with the following error:

⎯⎯⎯⎯⎯⎯ Unhandled Errors ⎯⎯⎯⎯⎯⎯

Vitest caught 1 unhandled error during the test run.
This might cause false positive tests. Resolve unhandled errors to make sure your tests are not affected.

⎯⎯⎯⎯⎯ Uncaught Exception ⎯⎯⎯⎯⎯
Error: listen EADDRINUSE: address already in use :::9099
 ❯ Server.setupListenHandle [as _listen2] node:net:1463:16
 ❯ listenInCluster node:net:1511:12
 ❯ Server.listen node:net:1599:7
 ❯ new WebSocketServer node_modules/ws/lib/websocket-server.js:90:20
 ❯ test/apify/events.test.ts:15:15
     13| 
     14|     beforeEach(() => {
     15|         wss = new WebSocket.Server({ port: 9099 });
       |               ^
     16|         vitest.useFakeTimers();
     17|         process.env[ACTOR_ENV_VARS.EVENTS_WEBSOCKET_URL] = 'ws://local…
 ❯ node_modules/@vitest/runner/dist/index.js:135:14
 ❯ node_modules/@vitest/runner/dist/index.js:58:[26](https://github.com/apify/apify-sdk-js/actions/runs/7004682693/job/19052930717?pr=257#step:9:27)
 ❯ node_modules/@vitest/runner/dist/index.js:582:59
 ❯ callSuiteHook node_modules/@vitest/runner/dist/index.js:582:47

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
Serialized Error: { code: 'EADDRINUSE', errno: -98, syscall: 'listen', address: '::', port: 9099 }
This error originated in "test/apify/events.test.ts" test file. It doesn't mean the error was thrown inside the file itself, but while it was running.
The latest test that might've caused the error is "should send persist state events in regular interval". It might mean one of the following:
- The error was thrown, while Vitest was running this test.
- This was the last recorded test before the error was thrown, if error originated after test finished its execution.

https://github.com/apify/apify-sdk-js/actions/runs/7004682693/job/19052930717?pr=257

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.