strinking / statbot Goto Github PK
View Code? Open in Web Editor NEWA data ingestion bot that reads records from Discord guilds into a SQL database.
License: MIT License
A data ingestion bot that reads records from Discord guilds into a SQL database.
License: MIT License
Right now statbot does not track any information about pins because of some difficulty in doing so. When a pin is created, a " pinned a message." system message appears. This provides information on who and when the pinning occurred. It, however, does not say which message was pinned. This is normally something you get by looking at the list of pinned messages, and seeing which one came last.
But the problem is that you can delete those system messages, and when you unpin messages, those system messages remain. This makes matching up information about who pinned a message with the message that was pinned nontrivial, and in cases of deleted messages, sometimes impossible.
Add a configurable option to track users' past avatars and the time it was changed.
See also #58
sql.py
has gotten quite large. Statbot should adopt a system similar to futaba's that allows keeping SQL functionality isolated, but spread among several submodules.
The messages
table tracks the current state of all messages. However any information in edits is lost. We should add a message_history
table with the following schema:
Column Name | Type | Other |
---|---|---|
message_id |
BigInteger |
Primary key, foreign key messages.message_id |
date |
DateTime |
Primary key |
content |
UnicodeText |
|
embeds |
JSON |
This way we preserve the existing messages
table and also don't duplicate tons of information over several rows. For instance, the message_type
or channel_id
of a message won't change over edits, and so can be fetched by JOIN
ing on messages
.
Currently the bot only has an event handler that listens for new messages, etc. However this will deny it the opportunity to ingest past data. The proposal is to have a separate async task or thread that keeps track of which messages it has already read, and to scan through ones it has not to add them to the database.
This crawler will need to be fault-tolerant, allowing it to pick up where it left off in the event of an ungraceful exit.
It should be able to patch holes, knowing which sections of history it has already covered and fill in areas that it hasn't covered yet.
It should be modular and extensible, with hooks that it runs on each message, and modify tables indirectly through the sql
handler object.
Add a section to the configuration file that allows for certain features to be disabled. As of now, this should include:
See also #60
Striking has an issue where on his machine, statbot only parses #memes
and #general
, but only partially reading through the entire history before giving up and saying all channels have been exhausted.
The voice_channels
table has a changed_roles
column (an ARRAY(BigInteger)
) to store which roles have differing permissions for a given channel. Since this applies to normal (i.e. text) channels as well, it should also be added.
Similar to (but hopefully better than) mallard's deploy script.
Repro: point statbot at a guild containing channels that don't have Read Message History explicitly granted to a bot user. The crawler will ignore those channels.
Consider
Line 159 in 8113710
Replacing this conditional with True
will cause the crawler to start indexing those channels again (although it will then fail on channels it can't read).
If you look at the audit_log
and audit_log_changes
table, you see that the entries listed don't seem to line up with what the client shows. Checking the audit_log_crawl
table shows that the last entry crawled was ID 365824096744767488
, which corresponds to 2017-10-06 11:34:21.347000 UTC
which seems a bit out of date.
The audit log and channel history crawlers are both using the same logic, so something weird is happening here.
Some ideas:
The audit log does not appear to fetch all items, with entries stopping after some time in the past. Despite this, the last_audit_entry_id
field used to track audit log crawling is up-to-date.
Right now we don't really have a good way of checking if a person is a member of a guild.
You can lookup in role_membership
when the guild ID matches joining on the roles
table to find the @everyone
role, but that's a huge pain, and since role information is preserved even for people who leave, it's not totally accurate.
So instead a separate table for tracking guild-specific data should be made.
I get the following error when trying to do upsert_emoji()
:
Traceback (most recent call last):
File "/home/ammon/.local/lib/python3.6/site-packages/discord/client.py", line 197, in _run_event
yield from coro(*args, **kwargs)
File "/home/ammon/Git/statbot/statbot/client.py", line 174, in on_reaction_add
async def on_reaction_clear(self, message, reactions):
File "/home/ammon/Git/statbot/statbot/sql.py", line 344, in add_reaction
'channel_id': message.channel.id,
File "/home/ammon/Git/statbot/statbot/sql.py", line 577, in upsert_emoji
File "/home/ammon/Git/statbot/statbot/sql.py", line 218, in _emoji_values
self.db.execute(ups)
AttributeError: 'PartialReactionEmoji' object has no attribute 'guild'
Add a configuration option and the appropriate handling to allow a user to specify either themselves or a bot they control as the gateway to get the messages.
This should only require a relatively small change.
Be sure to use some platform-independent python functionality to accomplish this. Since different configuration files may describe different instances, it should be per-config.
As it is right now, the users
table has no clean way to track nicknames. However with the creation of a new table, it is possible.
I propose a new table called nicknames
with the given schema:
user_id |
guild_id |
nickname |
---|---|---|
BigInteger |
BigInteger |
Optional[Unicode(32)] |
(with foreign keys for user_id
and guild_id
)
This way we have a relational solution to storing nicknames despite each guild having a potentially different value.
Would be interesting to correlate banned user IDs with other possibly pertinent information, like membership in other servers or post speed etc.
Animated emojis take the form <a:[label]:[id]>
, but are otherwise the same as normal emojis.
There are members who are reported as having "false" in the is_member
column, despite being members. I believe this has to do with checking membership on startup, as the member upserts may not be changing that column.
Right now roles are only updated when events are received.
Instead we should have all pre-existing roles from all guilds added on startup.
when i try to start bot
main.py: error: unrecognized arguments: [-d] config.yaml
Eventually all bots with standardize on reading certain details like the bot token from environment variables.
Currently, the docker-compose.yml stores postgres data in /srv/docker/postgres/
on the host. It would be nice if this data were stored inside the repository instead, so that it cannot affect other applications or copies of the repository.
Right now the messages
table has an is_edited
column, but it does not store the time that it was edited.
My proposal is to add a datetime column called timestamp
. If a post is new, this column is identical to the value extracted from the message snowflake. If the message has been edited or deleted, this column is updated to when that event occurred.
Hello! Having issues with the setup, it looks like the channel crawler is trying to reference a value from the un-initialized channels table?
The only channel that populates is 'users', the subsequent SQL insertions all fail due to the tables not being initialized correctly. Is there a setup step I missed?
Log:
[INFO] statbot.event: Initializing SQL lookup tables...
[INFO] statbot.event: Processing X users...
[INFO] statbot.event: Processing X guilds...
[INFO] statbot.event:
[INFO] statbot.event: Ready!
[INFO] statbot.sql: Looking up channel crawl progress for {server} #{channel}
[INFO] statbot.sql: Inserting new channel crawl progress for {server} #{channel}
[ERROR] statbot.sql: Exception occurred in 'with' scope!
Traceback (most recent call last):
File ".../python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
self.dialect.do_execute(
File ".../python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ForeignKeyViolation: insert or update on table "channel_crawl" violates foreign key constraint "channel_crawl_channel_id_fkey"
DETAIL: Key (channel_id)=(1002...) is not present in table "channels".
Like channel messages, getting data from the audit log is like crawling through the history, except there are no events.
What I propose is a dedicated async task that fetches any holes in the records. Then when all the history is fetched, it waits for a long period, then does a fetch for the newly created hole (present to last added).
This also involves designing a schema for how audit log information will be stored relationally.
Required for #4.
Preferably this will allow something nice and convenient like:
with sql.lock():
# do stuff
# ...
Right now all code that supports emoji lookups and reactions is disabled. This issue will resolve the difficulties that caused them to be temporarily removed, as well as add new code for handling those events. (e.g. adding to reactions
when crawling messages)
One of the issues are IDs. There are two kinds of emojis: "normal" unicode ones, which are just unicode code points, and custom emojis. While the first are simply characters, the second are Discord objects with a snowflake, associated guild, etc.
Previously I just made the ID of a unicode emoji it's ord()
value. But this is problematic if you want to get_emoji(id)
or perform other operations on the ID.
To resolve this, I propose making the emoji ID effectively an algebraic data type. It will have a primary key composed of two columns, a unicode character, and an ID. Exactly one of the two will be null, and the one with the value will show which type it is. For a unicode emoji, it's "ID" will be the unicode character itself.
Schema proposals:
emojis
:
emoji_id |
emoji_unicode |
name |
category |
guild_id |
---|---|---|---|---|
Optional[BigInteger] |
Optional[Unicode(1)] |
String |
String |
Optional[BigInteger] |
reactions
:
timestamp |
message_id |
emoji_id |
emoji_unicode |
user_id |
channel_id |
guild_id |
---|---|---|---|---|---|---|
DateTime |
BigInteger |
Optional[BigInteger] |
Optional[Unicode(1)] |
BigInteger |
BigInteger |
BigInteger |
Since on event (e,g, on_message
) may result in the need to change multiple rows, it is more efficient and atomic if we can group it into a transaction.
The way I am thinking of doing this is to make the sql
object stateful in storing the current transaction. All SQLAlchemy queries will go through some private method, say _execute()
. When this method is called, if there is an active transaction, it is added. Otherwise the method will automatically query and commit it like normal via self.db.execute()
Then, when a client calls a sql
object method, they will designate the creation and finish of a transaction. When signaled by another method, the transaction will attempt to commit its entire contents. If this fails, it will revert the entire transaction and print an error to the log.
It may look something like this:
with sql.transaction():
sql.add_message(message)
sql.add_emoji(emoji)
# etc...
See the status-changes
branch.
We can track who is a member of each role in a relational way.
I propose a new table, possibly called role_membership
, with the given schema:
role_id |
guild_id |
user_id |
---|---|---|
BigInteger |
BigInteger |
BigInteger |
(with foreign keys for all three columns)
Something is wrong with the logic of how history()
is interpreted, causing the first_message_id
, what is supposed to be the very first message in a channel, to refer to an ID that is not actually the first message.
Possibly related to #25
The problem happens when a user has added some reaction, or has their ID in the audit log, but have left the server. In that case, we get a bunch of errors of foreign key violations because the user doesn't exist in the users
table. I'm not sure if the better way is to remove foreign key, or just create a dummy "Unknown user" in the users table...
psycopg2.errors.ForeignKeyViolation: insert or update on table "audit_log" violates foreign key constraint "audit_log_int_user_id_fkey"
DETAIL: Key (int_user_id)=([REDACTED]) is not present in table "users".
sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "reactions" violates foreign key constraint "reactions_int_user_id_fkey"
DETAIL: Key (int_user_id)=([REDACTED]) is not present in table "users".
There are several columns like user_id
and guild_id
that have meaning across multiple tables. Those should be foreign keys into their respective lookup tables.
For instance, the reactions
table has the following schema:
message_id | emoji_id | user_id | guild_id
After implementing this issue, message_id
would be a foreign key for the messages
table, emoji_id
would be a foreign key for emoji_lookup
, user_id
for user_lookup
, etc.
It will also be necessary to validate that those lookup tables are sufficiently up-to-date when a sister table is updated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.