sanqui / discard2 Goto Github PK

View Code? Open in Web Editor NEW

37.0 3.0 1.0 38.18 MB

Discard2 is a high fidelity archival tool for the Discord chat platform

License: MIT License

TypeScript 95.16% JavaScript 0.72% Dockerfile 1.03% Lua 0.68% Python 2.41%

archiver discord

discard2's People

Contributors

Stargazers

Watchers

Forkers

salman-irfan

discard2's Issues

Voice channel text channels?

Voice channels can now have text channels baked in. Does discard2 take these into account?

UI: Handle opening servers in folders

Caught error while performing task: TypeError: Cannot read properties of undefined (reading 'click')

*** Task: ProfileDiscordTask (0 more)
Caught error while performing task: TypeError: Cannot read properties of undefined (reading 'click')
Saved screenshot to out/20230419T130738-profile/error.png
Closing dummy capture tool
/app/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===

TypeError: Cannot read properties of undefined (reading 'click')
at ProfileDiscordTask._getEmail (/app/src/crawler/projects/discord/profile.ts:43:37)
at runMicrotasks ()
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ProfileDiscordTask.perform (/app/src/crawler/projects/discord/profile.ts:64:23)
at async Crawler.run (/app/src/crawler/crawl.ts:312:28)
at async crawler (/app/src/cli.ts:83:5)
at async Command. (/app/src/cli.ts:89:9)

Huge channels sometimes have slow search results

Not sure if this is just my Internet or the channel being huge, but figured I'd report it either way (if it's my Internet, you should add a way to increase the timeout).

https://discord.gg/2cujRs9K : in the #bot channel, scraping fails with

*** Task: ChannelDiscordTask (7 more)
Channel 514225574051446795 opened
Caught error while performing task: Error: Did not get search results after 10 seconds.
Saved screenshot to out/20220607T020524-resume/error.png
Stopping mitmdump
/home/thetechrobo/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
                                                                                                                                                                                                                              ^

Error: Did not get search results after 10 seconds.
    at performAndWaitForSearchResults (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:131:27)
    at async ChannelDiscordTask._searchAndClickFirstResult (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:136:29)
    at async ChannelDiscordTask.perform (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:253:24)
    at async Crawler.run (/home/thetechrobo/discard2/src/crawler/crawl.ts:258:28)
    at async crawler (/home/thetechrobo/discard2/src/cli.ts:71:5)
    at async Command.<anonymous> (/home/thetechrobo/discard2/src/cli.ts:148:9)

Here's the error screenshot:

Timeouts are too strict on slow machines

As per #5 (comment).

I'll be testing on a slow machine to help with this.

Chrome uses too much ram

RAM usage grows and grows on Chrome until the page crashes.

Related: #15

Getting tons of Trakt error messages on Kodi

I installed the latest version of kodi. Since then I keep getting multiple Trakt error messages although it still works. Trakt said it was a kodi problem. The messages I'm getting are: Trakt error 502, Remote communication server failed to start and Trakt wait limit reached 429. I am mainly using scrubs v2. Anyone?

I don't do builds and I"m a newbie so please dumb it down for me. Thanks!

`Error: No node found for selector: div[aria-controls="oldest-tab"]`

Add a way to do "explain" ala ArchiveBot

There are three ways I can think of implementing this:

Add an explain for jobs
Add an explain for individual tasks
Add an "Explain" task (as suggested in #6 (comment))

1. and 2. aren't mutually exclusive and I think are good ideas. The job explain could be simply provided using --explain on the command line. I can't think a good reason for 3. but maybe @TheTechRobo can give some example use case?

Create tests for job results

Create "result" structure for tasks

Individual tasks should report on their runtime results: in particular, names of servers and channels, IDs of first and last messages encountered, perhaps total number of messages seen.

Logical way of implementing #3 as well as paving way for incremental jobs ("get all messages up to the newest messages in this finished job").

Unable to login

Error: Captcha detected on login.  It is recommended you log into this account manually in a browser from the same IP.

No further option is suggested to work around this. A browser is already logged in with no success

Specify output directory

Add a reader to show all messages

Would be helpful for my URL extractor so I don't have to parse through the raw-print or the raw-jsonl.

Add a crawler to capture all DMs

When you have a lot of DMs, it's not practical to run them one at a time. I see DMs as channels rather than servers, so there should be something to load all of them.

mitmproxy reader: support multiple WS streams

Periodically save the message ID to the state.json

This would be a path to resuming the crawl without using the date filters (which is a bad way to resume since there are the messages loaded when the channel is clicked on included in the dump).

The message ID fetched should preferably be from the top of the loaded messages so it doesn't accidentally skip surrounding messages.

Unlike #6, this would be good for putting inside the task listed as "current" in the state.json so crashed jobs can be quickly and easily resumed.

[SORRY, WRONG REPOSITORY]

It's used by Discord for emojis: https://cdn.discordapp.com/emojis/821174008284315718.webp?size=44&quality=lossless

Does it actually have better quality?

Measure Chrome RAM using better method

https://media-codings.com/articles/automatically-detect-memory-leaks-with-puppeteer

TimeoutError when loading a server if server is low in the server list

~/d/o/c/krita git:master ❯❯❯ npm run start -- crawler server 435123295550046218 -c mitmdump --headless

> start
> ts-node ./src/cli.ts "crawler" "server" "435123295550046218" "-c" "mitmdump" "--headless"

Using mitmdump at /usr/bin/mitmdump
Initiated new job state name 20220623T214047-server
Starting mitmdump
mitmdump stderr:  /home/thetechrobo/.local/lib/python3.9/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.14.0-unknown is an invalid version and will not be supported in a future release
  warnings.warn(

*** Task: LoginDiscordTask (2 more)
Filling in login form
Logged in: [redacted]
*** Task: ProfileDiscordTask (1 more)
Email read as  [edit: redacted]
*** Task: ServerDiscordTask (0 more)
Caught error while performing task: TimeoutError: waiting for selector `#channels ul li a[href^="/channels/435123295550046218"]` failed: timeout 30000ms exceeded
Saved screenshot to out/20220623T214047-server/error.png
Stopping mitmdump
/media/thetechrobo/2tb/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
                                                                                                                                                                                                                              ^

TimeoutError: waiting for selector `#channels ul li a[href^="/channels/435123295550046218"]` failed: timeout 30000ms exceeded
    at new WaitTask (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:813:28)
    at DOMWorld.waitForSelectorInPage (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:656:22)
    at Object.internalHandler.waitFor (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/QueryHandler.ts:78:19)
    at DOMWorld.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:511:25)
    at Frame.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/FrameManager.ts:1290:47)
    at Page.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/Page.ts:3210:29)
    at openServer (/media/thetechrobo/2tb/discard2/src/crawler/projects/discord/server.ts:22:20)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ServerDiscordTask.perform (/media/thetechrobo/2tb/discard2/src/crawler/projects/discord/server.ts:64:9)

I think I know why this is happening. This has happened multiple times in different servers and the common factor is that it's down near the bottom of the server list. Indeed, when I moved the server to the top of the list, it worked.

ETA only takes first four digits into account

*** Task: ChannelDiscordTask (24 more)
Channel 500102940078505984 opened
Search results: 5,707,364 Results
Estimate to download 5707 messages:  1 minutes

Strip quotes around strings in .env

The fact that they remain is surprising behavior according to #10 (comment).

Can't add trailing slashes when resuming

npm run start -- crawler resume 20220618T213621-server/ -c mitmdump --headless results in:

[Error: ENOENT: no such file or directory, open '20220618T213621-server//state.json'] {
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: '20220618T213621-server//state.json'
}

~/discard2 git:master ❯❯❯ npm run --silent start -- reader -f raw-jsonl $JOB_DIRECTORY > $JOB_DIRECTORY/jsonl.jsonl
mitmproxy read stderr: Traceback (most recent call last):
  File "mitmproxy/read.py", line 9, in <module>
    from mitmproxy import io, http
ModuleNotFoundError: No module named 'mitmproxy'

/home/thetechrobo/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
                                                                                                                                                                                                                              ^

Error: mitmproxy read exited with code 1
    at ChildProcess.<anonymous> (/home/thetechrobo/discard2/src/reader/mitmproxy.ts:22:19)
    at ChildProcess.emit (node:events:394:28)
    at ChildProcess.emit (node:domain:475:12)
    at Process.ChildProcess._handle.onexit (node:internal/child_process:290:12)

I've got mitmproxy installed via both pip (--user) and apt.

Get server emojis

This is listed in the README, but I figured I'd report it as an issue here for discussion.

Maybe this would work: https://discord.com/developers/docs/resources/emoji#list-guild-emojis

But if you don't want to manually make requests to the Discord API, I think this system would work:

For every channel encountered, see if you can type in it.
If so, click the emoji button.
If not, check if you can add a reaction to the latest message. If so, click the button to add one.
This opens the emoji picker. Scroll until you've scrolled through all emojis for that server.
If you couldn't open the emoji picker, try again in the next channel. Chances are, you'll eventually find a channel that meets either criteria.

Suggested docker volume mount modes are invalid

I get

docker: Error response from daemon: invalid mode: Z,U.

when trying the suggested command