Giter Site home page Giter Site logo

disane87 / docudigger Goto Github PK

View Code? Open in Web Editor NEW
32.0 2.0 6.0 4.62 MB

Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)

Home Page: https://blog.disane.dev

License: MIT License

JavaScript 9.58% Shell 1.50% TypeScript 88.79% Batchfile 0.13%
dms invoices nodejs scraping

docudigger's Introduction

Welcome to docudigger πŸ‘‹

npm GitHub package.json dependency version (subfolder of monorepo) License: MIT

Document scraper for getting invoices automagically as pdf (useful for taxes or DMS)

🏠 Homepage

Prerequisites

  • npm >=9.1.2
  • node >=18.12.1

Configuration

All settings can be changed via CLI, env variable (even when using docker).

Setting Description Default value
AMAZON_USERNAME Your Amazon username null
AMAZON_PASSWORD Your amazon password null
AMAZON_TLD Amazon top level domain de
AMAZON_YEAR_FILTER Only extracts invoices from this year (i.e. 2023) 2023
AMAZON_PAGE_FILTER Only extracts invoices from this page (i.e. 2) null
ONLY_NEW Tracks already scraped documents and starts a new run at the last scraped one true
FILE_DESTINATION_FOLDER Destination path for all scraped documents ./documents/
FILE_FALLBACK_EXTENSION Fallback extension when no extension can be determined .pdf
DEBUG Debug flag (sets the log level to DEBUG) false
SUBFOLDER_FOR_PAGES Creates sub folders for every scraped page/plugin false
LOG_PATH Sets the log path ./logs/
LOG_LEVEL Log level (see https://github.com/winstonjs/winston#logging-levels) info
RECURRING Flag for executing the script periodically. Needs 'RECURRING_PATTERN' to be set. Default truewhen using docker container false
RECURRING_PATTERN Cron pattern to execute periodically. Needs RECURRING to true */30 * * * *
TZ Timezone used for docker environments Europe/Berlin

Install

⚠️ Attention: There is no need to install this locally. Just use npx

Usage

πŸ”¨ Make sure you have an .env file present (with the variables from above) in the work directory or use the appropriate cli arguments.

πŸš‘ If you want to use an .env file, make sure you use env-cmd (https://www.npmjs.com/package/env-cmd)

$ npx docudigger COMMAND
running command...

$ npx docudigger (--version)
@disane-dev/docudigger/2.0.2 linux-x64 node-v18.16.1

$ npx docudigger --help [COMMAND]
USAGE
  $ docudigger COMMAND

docudigger scrape all

Scrapes all websites periodically (default for docker environment)

USAGE
  $ npx docudigger scrape all [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l <value>] [-c <value> -r]

FLAGS
  -c, --recurringCron=<value>  [default: * * * * *] Cron pattern to execute periodically
  -d, --debug
  -l, --logPath=<value>        [default: ./logs/] Log path
  -r, --recurring
  --logLevel=<option>          [default: info] Specify level for logging.
                               <options: trace|debug|info|warn|error>

GLOBAL FLAGS
  --json  Format output as json.

DESCRIPTION
  Scrapes all websites periodically

EXAMPLES
  $ docudigger scrape all

docudigger scrape amazon

Used to get invoices from amazon

USAGE
  $ npx docudigger scrape amazon -u <value> -p <value> [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l
    <value>] [-c <value> -r] [--fileDestinationFolder <value>] [--fileFallbackExentension <value>] [-t <value>]
    [--yearFilter <value>] [--pageFilter <value>] [--onlyNew]

FLAGS
  -c, --recurringCron=<value>        [default: * * * * *] Cron pattern to execute periodically
  -d, --debug
  -l, --logPath=<value>              [default: ./logs/] Log path
  -p, --password=<value>             (required) Password
  -r, --recurring
  -t, --tld=<value>                  [default: de] Amazon top level domain
  -u, --username=<value>             (required) Username
  --fileDestinationFolder=<value>    [default: ./data/] Amazon top level domain
  --fileFallbackExentension=<value>  [default: .pdf] Amazon top level domain
  --logLevel=<option>                [default: info] Specify level for logging.
                                     <options: trace|debug|info|warn|error>
  --onlyNew                          Gets only new invoices
  --pageFilter=<value>               Filters a page
  --yearFilter=<value>               Filters a year

GLOBAL FLAGS
  --json  Format output as json.

DESCRIPTION
  Used to get invoices from amazon

  Scrapes amazon invoices

EXAMPLES
  $ docudigger scrape amazon

Docker

docker run \ 
  -e AMAZON_USERNAME='[YOUR MAIL]' \ 
  -e AMAZON_PASSWORD='[YOUR PW]' \
  -e AMAZON_TLD='de' \ 
  -e AMAZON_YEAR_FILTER='2020' \
  -e AMAZON_PAGE_FILTER='1' \
  -e LOG_LEVEL='info' \
  -v "C:/temp/docudigger/:/home/node/docudigger" \
  ghcr.io/disane87/docudigger

Dev-Time πŸͺ²

NPM

npm install
[Change created .env for your needs]
npm run start

Author

πŸ‘€ Marco Franke

🀝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page. You can also take a look at the contributing guide.

Show your support

Give a ⭐️ if this project helped you!


This README was generated with ❀️ by readme-md-generator

docudigger's People

Contributors

dependabot[bot] avatar disane87 avatar fwartner avatar renovate[bot] avatar semantic-release-bot avatar wehrmannit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

docudigger's Issues

[WR] vodafone.de

Which website do you want to crawl?

http://vodafone.de

What services do they provide?

Contracts (with recurring calculation in a defined interval)

In which industry does the company operate?

Telecommunication (like Vodafone)

If you choose other...

No response

Does the webpage have/needs an authentication?

  • Yes, the website needs an authentication

Do they provide a two factor auth?

  • Yes, the website needs an seconds factor

Would you provide us the credentials (privately)?

  • Yes, I would share my credentials

Are you willing to collaborate to get this scraper up and running?

  • Yes, I would collaborate on this actively.

What color (hex) represents the company?

#E60000

Your recorded code

import url from 'url';
import { createRunner } from '@puppeteer/replay';

export async function run(extension) {
    const runner = await createRunner(extension);

    await runner.runBeforeAllSteps();

    await runner.runStep({
        type: 'setViewport',
        width: 1134,
        height: 1284,
        deviceScaleFactor: 1,
        isMobile: false,
        hasTouch: false,
        isLandscape: false
    });

    await runner.runStep({
        type: 'navigate',
        url: 'https://www.vodafone.de/',
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://www.vodafone.de/',
                title: ''
            }
        ]
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/MeinVodafone',
                'aria/[role="generic"]'
            ],
            [
                'li.item-myvf span.icon'
            ],
            [
                'xpath///*[@id="top"]/div/header/nav/div/div[2]/div/div/ul[2]/li[2]/a/span[1]'
            ],
            [
                'pierce/li.item-myvf span.icon'
            ]
        ],
        offsetY: 16,
        offsetX: 6.5,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/Login[role="button"]'
            ],
            [
                '#meinVodafoneOverlay button'
            ],
            [
                'xpath///*[@id="mdd-login-form"]/fieldset/button'
            ],
            [
                'pierce/#meinVodafoneOverlay button'
            ]
        ],
        offsetY: 10,
        offsetX: 27.90625,
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://www.vodafone.de/meinvodafone/services/',
                title: ''
            }
        ]
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'li:nth-of-type(1) svg.icon-arrow-down-i-xsml'
            ],
            [
                'xpath///*[@id="dashboard:mobile"]/svg[1]'
            ],
            [
                'pierce/li:nth-of-type(1) svg.icon-arrow-down-i-xsml'
            ]
        ],
        offsetY: 7.015625,
        offsetX: 9.5,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'li:nth-of-type(1) div.tiles > a:nth-of-type(1) svg'
            ],
            [
                'xpath///*[@id="content"]/div[2]/div/div/section/div/div/div/div[3]/div[2]/ul/li[1]/div/div/div[1]/a[1]/div/div[1]/svg'
            ],
            [
                'pierce/li:nth-of-type(1) div.tiles > a:nth-of-type(1) svg'
            ]
        ],
        offsetY: 63.609375,
        offsetX: 22.484375,
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://www.vodafone.de/meinvodafone/services/ihre-rechnungen/rechnungen',
                title: ''
            }
        ]
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/Mehr anzeigen[role="button"]'
            ],
            [
                '#content button'
            ],
            [
                'xpath///*[@id="billoverviewWrapperId"]/bill-overview-history/bill-history/div/div[2]/div/div/div/div[2]/vf-table-brix/div[2]/div/button'
            ],
            [
                'pierce/#content button'
            ],
            [
                'text/Mehr anzeigen'
            ]
        ],
        offsetY: 10,
        offsetX: 44.375,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'tr:nth-of-type(1) > td:nth-of-type(4) span:nth-of-type(2) > svg'
            ],
            [
                'xpath///*[@id="billoverviewWrapperId"]/bill-overview-history/bill-history/div/div[2]/div/div/div/div[2]/vf-table-brix/div[2]/table/tbody/tr[1]/td[4]/div/span[2]/svg'
            ],
            [
                'pierce/tr:nth-of-type(1) > td:nth-of-type(4) span:nth-of-type(2) > svg'
            ]
        ],
        offsetY: 13.5,
        offsetX: 22.34375,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'tr:nth-of-type(1) > td:nth-of-type(5) span:nth-of-type(2) use'
            ],
            [
                'xpath///*[@id="billoverviewWrapperId"]/bill-overview-history/bill-history/div/div[2]/div/div/div/div[2]/vf-table-brix/div[2]/table/tbody/tr[1]/td[5]/div/span[2]/svg/use'
            ],
            [
                'pierce/tr:nth-of-type(1) > td:nth-of-type(5) span:nth-of-type(2) use'
            ]
        ],
        offsetY: 10.5,
        offsetX: 13.45843505859375,
    });

    await runner.runAfterAllSteps();
}

if (process && import.meta.url === url.pathToFileURL(process.argv[1]).href) {
    run()
}

The automated release is failing 🚨

🚨 The automated release from the main branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you can benefit from your bug fixes and new features again.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can fix this πŸ’ͺ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the main branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here are some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


DOCKER_USERNAME and DOCKER_PASSWORD environment variables must be set

Unfortunately this error doesn't have any additional information. Feel free to kindly ask the author of the @overtheairbrew/semantic-release-dockerbuildx plugin to add more helpful information.


Good luck with your project ✨

Your semantic-release bot πŸ“¦πŸš€

Amazon rolls out frontend changes in waves

Therefore versions of Amazon plugin can be too new for people or too old. It's a bit like schroedinger's cat unfortunately.

I can also confirm this error with a DE address from amazon.

But i must admit that i have docudigger running in portainer (converted via https://www.composerize.com/ )
I have made NO adjustments, except for the e-mail, password and log level set to debug.
Is it possible that you still have an "old" view of amazon? They like to distribute the changes step by step,

[0] [info] [2024-02-03 23-02:30] [scrape:all]: runAll
[0] [warn] [2024-02-03 23-02:30] [scrape:amazon]: process.json not found. Full run needed. OnlyNew deactivated.
[0] [info] [2024-02-03 23-02:34] [scrape:amazon]: Logged in
[0] [info] [2024-02-03 23-02:36] [scrape:amazon]: First possible year: 2020
[0] [info] [2024-02-03 23-02:36] [scrape:amazon]: Last possible year: 2020
[0] [info] [2024-02-03 23-02:36] [scrape:amazon]: Selecting start year 2020
[0] Error: No element found for selector: select[name="orderFilter"]
[0] docudigger scrape all exited with code 1

UrsprΓΌnglich gepostet von @Sky-Dragon in #446 (comment)

Clicking popup for invoices failes several times

[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 25. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-2 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "2" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 15. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-3 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "3" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 13. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-4 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "4" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 13. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-5 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "5" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 13. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-6 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "6" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 6. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-7 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "7" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 6. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-8 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "8" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 5. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-9 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "9" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Order date: 4. Juni 2023
[0] [error] [2023-07-05 10-07:11] [scrape:amazon]:      Couldn't get popover #a-popover-10 within 2000ms. Skipping
[0] [warn] [2023-07-05 10-07:11] [scrape:amazon]:       No invoices found. Order may be undelivered. Check again later.
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Processing "10" orders
[0] [info] [2023-07-05 10-07:11] [scrape:amazon]:       Page "1" done. Skipping to next page.

The automated release is failing 🚨

🚨 The automated release from the dev branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you can benefit from your bug fixes and new features again.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can fix this πŸ’ͺ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the dev branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here are some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


Missing package.json file.

A package.json file at the root of your project is required to release on npm.

Please follow the npm guideline to create a valid package.json file.


Good luck with your project ✨

Your semantic-release bot πŸ“¦πŸš€

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Other Branches

These updates are pending. To force PRs open, click the checkbox below.

  • chore(deps): replace dependency eslint-config-standard-with-typescript with eslint-config-love ^43.0.1

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

dockerfile
dockerfile
  • satantime/puppeteer-node 20-slim
dockerfile.debug
  • satantime/puppeteer-node 20-slim
github-actions
.github/workflows/build-and-release.yaml
  • actions/checkout v3
  • actions/setup-node v3
  • actions/checkout v3
  • sigstore/cosign-installer v3.5.0@59acb6260d9c0ba8f4a2f9d9b48431a222b68e20
  • docker/setup-buildx-action a530e948adbeb357dbca95a7f8845d385edf4438
  • docker/login-action 5f4866a30a54f16a52d2ecb4a3898e9e424939cf
  • docker/metadata-action v4
  • docker/build-push-action eb539f44b153603ccbfbd98e2ab9d4d0dcaf23a4
npm
package.json
  • @oclif/core ^3.23.0
  • @oclif/plugin-commands ^3.2.0
  • @oclif/plugin-help ^6.0.17
  • @oclif/plugin-plugins ^4.3.2
  • luxon ^3.4.4
  • node-cron ^3.0.3
  • puppeteer ^22.4.0
  • winston ^3.12.0
  • @commitlint/config-conventional ^18.6.2
  • @oclif/test ^3.2.1
  • @saithodev/semantic-release-gitea ^2.1.0
  • @semantic-release/changelog ^6.0.3
  • @semantic-release/commit-analyzer ^11.1.0
  • @semantic-release/exec ^6.0.3
  • @semantic-release/git ^10.0.1
  • @semantic-release/github ^9.2.6
  • @semantic-release/npm ^11.0.3
  • @semantic-release/release-notes-generator ^12.1.0
  • @types/chai ^4.3.12
  • @types/luxon ^3.4.2
  • @types/mocha ^10.0.6
  • @types/node ^20.11.25
  • @types/node-cron ^3.0.11
  • @types/puppeteer ^7.0.4
  • @types/winston ^2.4.4
  • @typescript-eslint/eslint-plugin ^6.21.0
  • chai ^5.1.0
  • conventional-changelog-conventionalcommits ^7.0.2
  • conventional-changelog-eslint ^5.0.0
  • copyfiles ^2.4.1
  • cross-env ^7.0.3
  • cross-var ^1.1.0
  • env-cmd ^10.1.0
  • envfile ^7.1.0
  • eslint ^8.57.0
  • eslint-config-oclif ^5.0.4
  • eslint-config-oclif-typescript ^3.0.48
  • eslint-config-prettier ^9.1.0
  • eslint-config-standard-with-typescript ^43.0.1
  • eslint-plugin-import ^2.29.1
  • eslint-plugin-n ^16.6.2
  • eslint-plugin-prettier ^5.1.3
  • eslint-plugin-promise ^6.1.1
  • husky ^9.0.11
  • mocha ^10.3.0
  • oclif ^4.5.0
  • semantic-release ^23.0.2
  • semantic-release-github-pullrequest ^1.3.0
  • shx ^0.3.4
  • ts-node ^10.9.2
  • tslib ^2.6.2
  • typescript ^5.4.2
  • node >=12.0.0

  • Check this box to trigger a request for Renovate to run again on this repository

Error: No element found for selector: select[name="orderFilter"]

2024-02-19 12:42:07 [0] [info] [2024-02-19 11-02:07] [scrape:all]: runAll
2024-02-19 12:42:08 [0] [warn] [2024-02-19 11-02:08] [scrape:amazon]: process.json not found. Full run needed. OnlyNew deactivated.
2024-02-19 12:42:15 [0] [info] [2024-02-19 11-02:15] [scrape:amazon]: Logged in
2024-02-19 12:42:18 [0] [info] [2024-02-19 11-02:18] [scrape:amazon]: First possible year: 2023
2024-02-19 12:42:18 [0] [info] [2024-02-19 11-02:18] [scrape:amazon]: Last possible year: 2023
2024-02-19 12:42:18 [0] [info] [2024-02-19 11-02:18] [scrape:amazon]: Selecting start year 2023
2024-02-19 12:42:18 [0] Error: No element found for selector: select[name="orderFilter"]
2024-02-19 12:42:18 [0] docudigger scrape all exited with code 1

I have set AMAZON_TLD to de

The automated release is failing 🚨

🚨 The automated release from the main branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you can benefit from your bug fixes and new features again.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can fix this πŸ’ͺ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the main branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here are some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


No npm token specified.

An npm token must be created and set in the NPM_TOKEN environment variable on your CI environment.

Please make sure to create an npm token and to set it in the NPM_TOKEN environment variable on your CI environment. The token must allow to publish to the registry https://registry.npmjs.org/.


Good luck with your project ✨

Your semantic-release bot πŸ“¦πŸš€

Include `./scripts/` in npm package

Remove the post-install.ts script from the package.json on npm package. Currently this breaks the npm install when using without --ignore-scripts

The automated release is failing 🚨

🚨 The automated release from the dev branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you can benefit from your bug fixes and new features again.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can fix this πŸ’ͺ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the dev branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here are some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.


Missing package.json file.

A package.json file at the root of your project is required to release on npm.

Please follow the npm guideline to create a valid package.json file.


Good luck with your project ✨

Your semantic-release bot πŸ“¦πŸš€

Remove example hook messages

[0] example hook running scrape:all
[0] [info] [2023-07-05 10-07:06] [scrape:all]: runAll
[0] example hook running scrape:amazon

Login url contains some language specific country codes

It seems the login url contains some sort of language specific contry codes used for openid:
openid.assoc_handle=usflex

This link can't be used accross other tlds of amazon. This token may vary between countries and languages so a mapping wouldn't be a good idea. Instead we need to go to the base page and navigate to the login page.

2FA currently not supported on docker

Currently 2FA of the scraped pages are not supported. Actually it's detected (i.e. for amazon) but there is no way to set it within a docker container

Error - Fresh Install - Docker process.json

root@docker:/videos/paperless-ngx_consumption# docker run -e AMAZON_USERNAME='{{REDACTED}}' -e AMAZON_PASSWORD='{{REDACTED}}' -e AMAZON_TLD='co.uk' -e AMAZON_YEAR_FILTER='2020' -e LOG_LEVEL='error' -v "/videos/paperless-ngx_consumption/:/home/node/docudigger" ghcr.io/disane87/docudigger
[0] [info] [2024-01-20 19-01:27] [scrape:all]:  runAll
[0] [warn] [2024-01-20 19-01:27] [scrape:amazon]:       process.json not found. Full run needed. OnlyNew deactivated. 
[0]     Error: No element found for selector: input[type=email]
[0] docudigger scrape all exited with code 1

Can't find much about this so looking for some guidance - thanks in advance πŸ‘

rosetta error:

Hello there!

I am getting following error when i try to run docudigger in docker on my M1 Mac under Mac OS 14.3
Anyone seen this before? πŸ™ƒ

thanks a lot in advance and greetings from austria!

2024-02-18 21:51:35 [0] [info] [2024-02-18 20-02:35] [scrape:all]: runAll
2024-02-18 21:51:36 [0] Error: Failed to launch the browser process!
2024-02-18 21:51:36 [0] rosetta error: failed to open elf at /lib64/ld-linux-x86-64.so.2
2024-02-18 21:51:36 [0]
2024-02-18 21:51:36 [0]
2024-02-18 21:51:36 [0]
2024-02-18 21:51:36 [0] TROUBLESHOOTING: https://pptr.dev/troubleshooting
2024-02-18 21:51:36 [0]
2024-02-18 21:51:36 [0] docudigger scrape all exited with code 1

Failed to launch the browser process

2023-07-05 12:45:26 [info] [2023-07-05 10-07:26] [scrape:all]:  runAll
2023-07-05 12:45:26     Error: Failed to launch the browser process! undefined
2023-07-05 12:45:26     [31:46:0705/104526.419327:ERROR:bus.cc(399)] Failed to connect to the bus:
2023-07-05 12:45:26      Failed to connect to socket /run/dbus/system_bus_socket: No such file or 
2023-07-05 12:45:26     directory
2023-07-05 12:45:26     [31:31:0705/104526.424899:ERROR:ozone_platform_x11.cc(239)] Missing X 
2023-07-05 12:45:26     server or $DISPLAY
2023-07-05 12:45:26     [31:31:0705/104526.424935:ERROR:env.cc(255)] The platform failed to 
2023-07-05 12:45:26     initialize.  Exiting.
2023-07-05 12:45:26 
2023-07-05 12:45:26 
2023-07-05 12:45:26     TROUBLESHOOTING: https://pptr.dev/troubleshooting

Amazon renamed some selectors

Currently fails with (on US Amazon):

TimeoutError: Waiting for selector `.pagination-full ul.a-pagination li:nth-last-child(2) a` failed: Waiting failed: 30000ms exceeded

Puppeteer error in docker

[0] [info] [2023-10-31 11-10:35] [scrape:all]: runAll
[0] Error: Failed to launch the browser process! undefined
[0] [34:50:1031/114056.228042:ERROR:bus.cc(407)] Failed to connect to the bus:
[0] Failed to connect to socket /run/dbus/system_bus_socket: No such file or
[0] directory
[0] [34:34:1031/114057.464120:ERROR:ozone_platform_x11.cc(239)] Missing X
[0] server or $DISPLAY
[0] [34:34:1031/114057.464132:ERROR:env.cc(255)] The platform failed to
[0] initialize. Exiting.
[0]
[0]
[0] TROUBLESHOOTING: https://pptr.dev/troubleshooting
[0]
[0] docudigger scrape all exited with code 1

[WR] meine.new-energie.de/portal

Which website do you want to crawl?

https://meine.new-energie.de/portal/

What services do they provide?

Contracts (with recurring calculation in a defined interval)

In which industry does the company operate?

Energy (i.e. your local energy company)

If you choose other...

No response

Describe the company

It's a local energy company in germany which provides services for energy, gas, water, sewage etc.

Does the webpage have/needs an authentication?

  • Yes, the website needs an authentication

Do they provide a two factor auth?

  • Yes, the website needs an seconds factor

Would you provide us the credentials (privately)?

  • Yes, I would share my credentials

Are you willing to collaborate to get this scraper up and running?

  • Yes, I would collaborate on this actively.

What color (hex) represents the company?

#9b004b

Your recorded code

import url from 'url';
import { createRunner } from '@puppeteer/replay';

export async function run(extension) {
    const runner = await createRunner(extension);

    await runner.runBeforeAllSteps();

    await runner.runStep({
        type: 'setViewport',
        width: 1134,
        height: 1227,
        deviceScaleFactor: 1,
        isMobile: false,
        hasTouch: false,
        isLandscape: false
    });
    await runner.runStep({
        type: 'navigate',
        url: 'https://www.vodafone.de/',
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://www.vodafone.de/',
                title: 'Vodafone.de | Mobilfunk, Handys & Internet-Anbieter'
            }
        ]
    });
    await runner.runStep({
        type: 'navigate',
        url: 'https://meine.new-energie.de/portal/dashboard',
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://meine.new-energie.de/portal/dashboard',
                title: ''
            }
        ]
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/ZUSTIMMEN'
            ],
            [
                '#login-new-dialog-privacy-btn-confirm'
            ],
            [
                'xpath///*[@id="login-new-dialog-privacy-btn-confirm"]'
            ],
            [
                'pierce/#login-new-dialog-privacy-btn-confirm'
            ]
        ],
        offsetY: 26.5,
        offsetX: 159.25,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/E-Mail'
            ],
            [
                '#input-30'
            ],
            [
                'xpath///*[@id="input-30"]'
            ],
            [
                'pierce/#input-30'
            ]
        ],
        offsetY: 5.234375,
        offsetX: 163.5,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/ANMELDEN',
                'aria/[role="generic"]'
            ],
            [
                'form > div.row span'
            ],
            [
                'xpath///*[@id="router-view"]/div[4]/div/div/div[2]/div/div[1]/div/form/div[3]/div/button/span'
            ],
            [
                'pierce/form > div.row span'
            ],
            [
                'text/Anmelden'
            ]
        ],
        offsetY: 12.734375,
        offsetX: 161.5,
        assertedEvents: [
            {
                type: 'navigation',
                url: 'https://login.new.de/post/?code=234cffe92b81d0c84d8ca88f2a5b3aff22b46871f233e41a90082dc738642fac-072c06e616474cb999fa&state=bpc-%2Fdashboard&redirect_uri=https%3A%2F%2Fmeine.new-energie.de%2Fportal%2Fapi%2Foidc',
                title: ''
            }
        ]
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'aria/Rechnungen'
            ],
            [
                'li:nth-of-type(3) li:nth-of-type(1) > a'
            ],
            [
                'xpath///*[@id="mainMenu"]/div[1]/ul[2]/li[3]/ul/li[1]/a'
            ],
            [
                'pierce/li:nth-of-type(3) li:nth-of-type(1) > a'
            ]
        ],
        offsetY: 10,
        offsetX: 72.6875,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                '#InviteLayerRealPersonname'
            ],
            [
                'xpath///*[@id="InviteLayerRealPersonname"]'
            ],
            [
                'pierce/#InviteLayerRealPersonname'
            ]
        ],
        offsetY: 38,
        offsetX: 585.5,
    });
    await runner.runStep({
        type: 'click',
        target: 'main',
        selectors: [
            [
                'tr:nth-of-type(1) span > span'
            ],
            [
                'xpath//html/body/div[2]/div/div/div/main/div[3]/div/div/ng-transclude/div/div[1]/div/div/ng-transclude/div/div/div/ng-include/div/div/div/div[1]/div/table/tbody/tr[1]/td[3]/bpc-document-download/bpc-download/a/ng-transclude/span/span'
            ],
            [
                'pierce/tr:nth-of-type(1) span > span'
            ]
        ],
        offsetY: 2.40625,
        offsetX: 6.5,
    });

    await runner.runAfterAllSteps();
}

if (process && import.meta.url === url.pathToFileURL(process.argv[1]).href) {
    run()
}

Paste your screenshots here

image
image

Paste your html here

No response

Amazon changed page classes in Germany

[0] docudigger scrape all exited with code 1
[0] [info] [2023-10-31 09-10:41] [scrape:all]: runAll
[0] [info] [2023-10-31 09-10:14] [scrape:amazon]: Only invoices since order 303-5389627-4477142 will be gathered.
[0] [info] [2023-10-31 09-10:25] [scrape:amazon]: Logged in
[0] [info] [2023-10-31 09-10:26] [scrape:amazon]: First possible year: 2023
[0] [info] [2023-10-31 09-10:26] [scrape:amazon]: Last possible year: 2023
[0] [info] [2023-10-31 09-10:26] [scrape:amazon]: Selecting start year 2023
[0] Error: No element found for selector: select[name="orderFilter"]

When using `pageFilter` and `yearFilter` the debug logs are false

[0] [info] [2023-07-06 09-07:00] [scrape:amazon]:       Last page "1" reached. Going to next year.
[0] [info] [2023-07-06 09-07:00] [scrape:amazon]:       Year "2023" drone. Skipping to next year

This should be instead:

[0] [info] [2023-07-06 09-07:00] [scrape:amazon]:       Last page "1" reached. Skipping next pages
[0] [info] [2023-07-06 09-07:00] [scrape:amazon]:       Year "2023" drone. Skipping next years

Checking for already handled orders should not use orders with zero invoices

Actually the last website scrape is determined by only the latest one but it's possible that the latest order is not shipped and therefore there is no invoice present.

Instead of checking only the lastest order we should check the lastest order with at least one invoice and all invoices should be saved. If there are invoices which are not saved, the run should be repeated.

not working anymore?

Would love to use this, but getting some generic errors.

[0] [info] [2024-04-18 11-04:04] [scrape:all]: runAll
[0] [warn] [2024-04-18 11-04:05] [scrape:amazon]: process.json not found. Full run needed. OnlyNew deactivated.
[0] [info] [2024-04-18 11-04:08] [scrape:amazon]: Logged in
[0] [info] [2024-04-18 11-04:13] [scrape:amazon]: First possible year: 2024
[0] [info] [2024-04-18 11-04:13] [scrape:amazon]: Last possible year: 2018
[0] [info] [2024-04-18 11-04:13] [scrape:amazon]: Selecting start year 2024
[0] [info] [2024-04-18 11-04:17] [scrape:amazon]: Page count: 5
[0] [info] [2024-04-18 11-04:17] [scrape:amazon]: Got 10 orders. Processing...
[0] [info] [2024-04-18 11-04:17] [scrape:amazon]: Order date: Be-Mobile NV
[0] [error] [2024-04-18 11-04:19] [scrape:amazon]: Couldn't get popover #a-popover-1 within 2000ms. Skipping
[0] [warn] [2024-04-18 11-04:19] [scrape:amazon]: No invoices found. Order may be undelivered. Check again later.
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Processing "1" orders
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Order date: Be-Mobile NV
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: 1 invoices found πŸ“ƒ
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Processing "2" orders
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-2609117-7652323_1.pdf
[0] [info] [2024-04-18 11-04:19] [scrape:amazon]: Order date: Be-Mobile NV
[0] [error] [2024-04-18 11-04:22] [scrape:amazon]: Couldn't get popover #a-popover-3 within 2000ms. Skipping
[0] [warn] [2024-04-18 11-04:22] [scrape:amazon]: No invoices found. Order may be undelivered. Check again later.
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: Processing "3" orders
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: Order date: Be-Mobile NV
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: 2 invoices found πŸ“ƒ
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: Processing "4" orders
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:22] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-3718149-6909133_1.pdf
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-3718149-6909133_2.pdf
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Order date: Be-Mobile NV
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: 1 invoices found πŸ“ƒ
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Processing "5" orders
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-6522887-3229134_1.pdf
[0] [info] [2024-04-18 11-04:23] [scrape:amazon]: Order date: Be-Mobile NV
[0] [error] [2024-04-18 11-04:25] [scrape:amazon]: Couldn't get popover #a-popover-6 within 2000ms. Skipping
[0] [warn] [2024-04-18 11-04:25] [scrape:amazon]: No invoices found. Order may be undelivered. Check again later.
[0] [info] [2024-04-18 11-04:25] [scrape:amazon]: Processing "6" orders
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: Order date: Be-Mobile NV
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: 1 invoices found πŸ“ƒ
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: Processing "7" orders
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-5507482-5813161_1.pdf
[0] [info] [2024-04-18 11-04:26] [scrape:amazon]: Order date: Be-Mobile NV
[0] [error] [2024-04-18 11-04:28] [scrape:amazon]: Couldn't get popover #a-popover-8 within 2000ms. Skipping
[0] [warn] [2024-04-18 11-04:28] [scrape:amazon]: No invoices found. Order may be undelivered. Check again later.
[0] [info] [2024-04-18 11-04:28] [scrape:amazon]: Processing "8" orders
[0] [info] [2024-04-18 11-04:28] [scrape:amazon]: Order date: Be-Mobile NV
[0] [info] [2024-04-18 11-04:28] [scrape:amazon]: 2 invoices found πŸ“ƒ
[0] [info] [2024-04-18 11-04:28] [scrape:amazon]: Processing "9" orders
[0] [info] [2024-04-18 11-04:29] [scrape:amazon]: Checking if folder exists. If not, create: data
[0] [info] [2024-04-18 11-04:29] [scrape:amazon]: Writing file: /home/node/docudigger/data/null_AMZ_304-6272390-9076308_1.pdf
[0] ProtocolError: Protocol error (Target.createTarget): Session with given id
[0] not found.
[0] docudigger scrape all exited with code 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.