sanqui / discard2 Goto Github PK
View Code? Open in Web Editor NEWDiscard2 is a high fidelity archival tool for the Discord chat platform
License: MIT License
Discard2 is a high fidelity archival tool for the Discord chat platform
License: MIT License
Voice channels can now have text channels baked in. Does discard2 take these into account?
*** Task: ProfileDiscordTask (0 more)
Caught error while performing task: TypeError: Cannot read properties of undefined (reading 'click')
Saved screenshot to out/20230419T130738-profile/error.png
Closing dummy capture tool
/app/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
^
TypeError: Cannot read properties of undefined (reading 'click')
at ProfileDiscordTask._getEmail (/app/src/crawler/projects/discord/profile.ts:43:37)
at runMicrotasks ()
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ProfileDiscordTask.perform (/app/src/crawler/projects/discord/profile.ts:64:23)
at async Crawler.run (/app/src/crawler/crawl.ts:312:28)
at async crawler (/app/src/cli.ts:83:5)
at async Command. (/app/src/cli.ts:89:9)
Not sure if this is just my Internet or the channel being huge, but figured I'd report it either way (if it's my Internet, you should add a way to increase the timeout).
https://discord.gg/2cujRs9K : in the #bot channel, scraping fails with
*** Task: ChannelDiscordTask (7 more)
Channel 514225574051446795 opened
Caught error while performing task: Error: Did not get search results after 10 seconds.
Saved screenshot to out/20220607T020524-resume/error.png
Stopping mitmdump
/home/thetechrobo/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
^
Error: Did not get search results after 10 seconds.
at performAndWaitForSearchResults (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:131:27)
at async ChannelDiscordTask._searchAndClickFirstResult (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:136:29)
at async ChannelDiscordTask.perform (/home/thetechrobo/discard2/src/crawler/projects/discord/channel.ts:253:24)
at async Crawler.run (/home/thetechrobo/discard2/src/crawler/crawl.ts:258:28)
at async crawler (/home/thetechrobo/discard2/src/cli.ts:71:5)
at async Command.<anonymous> (/home/thetechrobo/discard2/src/cli.ts:148:9)
As per #5 (comment).
I'll be testing on a slow machine to help with this.
RAM usage grows and grows on Chrome until the page crashes.
Related: #15
I installed the latest version of kodi. Since then I keep getting multiple Trakt error messages although it still works. Trakt said it was a kodi problem. The messages I'm getting are: Trakt error 502, Remote communication server failed to start and Trakt wait limit reached 429. I am mainly using scrubs v2. Anyone?
I don't do builds and I"m a newbie so please dumb it down for me. Thanks!
There are three ways I can think of implementing this:
1. and 2. aren't mutually exclusive and I think are good ideas. The job explain could be simply provided using --explain
on the command line. I can't think a good reason for 3. but maybe @TheTechRobo can give some example use case?
Individual tasks should report on their runtime results: in particular, names of servers and channels, IDs of first and last messages encountered, perhaps total number of messages seen.
Logical way of implementing #3 as well as paving way for incremental jobs ("get all messages up to the newest messages in this finished job").
Error: Captcha detected on login. It is recommended you log into this account manually in a browser from the same IP.
No further option is suggested to work around this. A browser is already logged in with no success
Would be helpful for my URL extractor so I don't have to parse through the raw-print or the raw-jsonl.
When you have a lot of DMs, it's not practical to run them one at a time. I see DMs as channels rather than servers, so there should be something to load all of them.
This would be a path to resuming the crawl without using the date filters (which is a bad way to resume since there are the messages loaded when the channel is clicked on included in the dump).
The message ID fetched should preferably be from the top of the loaded messages so it doesn't accidentally skip surrounding messages.
Unlike #6, this would be good for putting inside the task listed as "current" in the state.json so crashed jobs can be quickly and easily resumed.
It's used by Discord for emojis: https://cdn.discordapp.com/emojis/821174008284315718.webp?size=44&quality=lossless
Does it actually have better quality?
~/d/o/c/krita git:master ❯❯❯ npm run start -- crawler server 435123295550046218 -c mitmdump --headless
> start
> ts-node ./src/cli.ts "crawler" "server" "435123295550046218" "-c" "mitmdump" "--headless"
Using mitmdump at /usr/bin/mitmdump
Initiated new job state name 20220623T214047-server
Starting mitmdump
mitmdump stderr: /home/thetechrobo/.local/lib/python3.9/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.14.0-unknown is an invalid version and will not be supported in a future release
warnings.warn(
*** Task: LoginDiscordTask (2 more)
Filling in login form
Logged in: [redacted]
*** Task: ProfileDiscordTask (1 more)
Email read as [edit: redacted]
*** Task: ServerDiscordTask (0 more)
Caught error while performing task: TimeoutError: waiting for selector `#channels ul li a[href^="/channels/435123295550046218"]` failed: timeout 30000ms exceeded
Saved screenshot to out/20220623T214047-server/error.png
Stopping mitmdump
/media/thetechrobo/2tb/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
^
TimeoutError: waiting for selector `#channels ul li a[href^="/channels/435123295550046218"]` failed: timeout 30000ms exceeded
at new WaitTask (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:813:28)
at DOMWorld.waitForSelectorInPage (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:656:22)
at Object.internalHandler.waitFor (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/QueryHandler.ts:78:19)
at DOMWorld.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/DOMWorld.ts:511:25)
at Frame.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/FrameManager.ts:1290:47)
at Page.waitForSelector (/media/thetechrobo/2tb/discard2/node_modules/puppeteer/src/common/Page.ts:3210:29)
at openServer (/media/thetechrobo/2tb/discard2/src/crawler/projects/discord/server.ts:22:20)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async ServerDiscordTask.perform (/media/thetechrobo/2tb/discard2/src/crawler/projects/discord/server.ts:64:9)
I think I know why this is happening. This has happened multiple times in different servers and the common factor is that it's down near the bottom of the server list. Indeed, when I moved the server to the top of the list, it worked.
*** Task: ChannelDiscordTask (24 more)
Channel 500102940078505984 opened
Search results: 5,707,364 Results
Estimate to download 5707 messages: 1 minutes
The fact that they remain is surprising behavior according to #10 (comment).
npm run start -- crawler resume 20220618T213621-server/ -c mitmdump --headless
results in:
[Error: ENOENT: no such file or directory, open '20220618T213621-server//state.json'] {
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: '20220618T213621-server//state.json'
}
Doing this would be really helpful for seeing what server I scraped without having to find its ID (I leave a server after I archive it, so it's really annoying).
I can reproduce in a freshly-made Discord.
Chrome's RAM usage steadily grows when using it with this. I wanted to see if Firefox did any better - Puppeteer does indeed support it. But it's broken. It can't get past the login page - it never ends up filling the form. I'm not sure why.
Instead of a timeout, report the error in question. See #5 (comment) for an example (quotes around email).
~/discard2 git:master ❯❯❯ npm run --silent start -- reader -f raw-jsonl $JOB_DIRECTORY > $JOB_DIRECTORY/jsonl.jsonl
mitmproxy read stderr: Traceback (most recent call last):
File "mitmproxy/read.py", line 9, in <module>
from mitmproxy import io, http
ModuleNotFoundError: No module named 'mitmproxy'
/home/thetechrobo/discard2/node_modules/brotli/build/encode.js:3
1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
^
Error: mitmproxy read exited with code 1
at ChildProcess.<anonymous> (/home/thetechrobo/discard2/src/reader/mitmproxy.ts:22:19)
at ChildProcess.emit (node:events:394:28)
at ChildProcess.emit (node:domain:475:12)
at Process.ChildProcess._handle.onexit (node:internal/child_process:290:12)
I've got mitmproxy installed via both pip (--user) and apt.
This is listed in the README, but I figured I'd report it as an issue here for discussion.
Maybe this would work: https://discord.com/developers/docs/resources/emoji#list-guild-emojis
But if you don't want to manually make requests to the Discord API, I think this system would work:
I get
docker: Error response from daemon: invalid mode: Z,U.
when trying the suggested command
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.