Giter Site home page Giter Site logo

strinking / statbot Goto Github PK

View Code? Open in Web Editor NEW
20.0 3.0 5.0 433 KB

A data ingestion bot that reads records from Discord guilds into a SQL database.

License: MIT License

Python 99.09% Shell 0.57% Mako 0.34%
sql discord sqlalchemy python postgresql statbot

statbot's Introduction

statbot

A Discord bot that reads in posts from a given set of servers and stores it in a SQL database. This application has two parts: the listener, which ingests raw data as it arrives, and the crawler, which walks through past history of Discord messages and adds them.

This bot is designed for use with Postgres, but in principle could be used with any database that SQLAlchemy supports.

Available under the terms of the MIT License.

Requirements

You can install them all using pip by running:

pip3 install -r requirements.txt

Execution

After preparing a configuration file, (see misc/config.yaml) you can call the program as follows:

python3 -m statbot [-q] [-d] your_config_file.yaml

A sample docker-compose.yaml configuration is also provided in misc/ in case you would like to host your PostgreSQL database via Docker.

Questions

How do I use statbot as a selfbot?

You shouldn't. Each person who you collect data from must explicitly agree to it. If you are running a server you have the ability to enforce this, but that also means you may as well just use an actual bot account. We will not support forks that add selfbot support to statbot, and we will not accept patches that do so either.

statbot's People

Contributors

alostsock avatar emmiegit avatar ma-wa-re avatar raylu avatar robstolarz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

statbot's Issues

Relax foreign key constraints on Audit logs and reactions

The problem happens when a user has added some reaction, or has their ID in the audit log, but have left the server. In that case, we get a bunch of errors of foreign key violations because the user doesn't exist in the users table. I'm not sure if the better way is to remove foreign key, or just create a dummy "Unknown user" in the users table...

psycopg2.errors.ForeignKeyViolation: insert or update on table "audit_log" violates foreign key constraint "audit_log_int_user_id_fkey"
DETAIL:  Key (int_user_id)=([REDACTED]) is not present in table "users".

sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "reactions" violates foreign key constraint "reactions_int_user_id_fkey"
DETAIL:  Key (int_user_id)=([REDACTED]) is not present in table "users".

Store information about pinned messages

Right now statbot does not track any information about pins because of some difficulty in doing so. When a pin is created, a " pinned a message." system message appears. This provides information on who and when the pinning occurred. It, however, does not say which message was pinned. This is normally something you get by looking at the list of pinned messages, and seeing which one came last.

But the problem is that you can delete those system messages, and when you unpin messages, those system messages remain. This makes matching up information about who pinned a message with the message that was pinned nontrivial, and in cases of deleted messages, sometimes impossible.

Crawling giving up early

Striking has an issue where on his machine, statbot only parses #memes and #general, but only partially reading through the entire history before giving up and saying all channels have been exhausted.

Track role membership

We can track who is a member of each role in a relational way.

I propose a new table, possibly called role_membership, with the given schema:

role_id guild_id user_id
BigInteger BigInteger BigInteger

(with foreign keys for all three columns)

Audit log live updates

The audit log does not appear to fetch all items, with entries stopping after some time in the past. Despite this, the last_audit_entry_id field used to track audit log crawling is up-to-date.

Track nicknames

As it is right now, the users table has no clean way to track nicknames. However with the creation of a new table, it is possible.

I propose a new table called nicknames with the given schema:

user_id guild_id nickname
BigInteger BigInteger Optional[Unicode(32)]

(with foreign keys for user_id and guild_id)

This way we have a relational solution to storing nicknames despite each guild having a potentially different value.

Fix emojis and reactions

Right now all code that supports emoji lookups and reactions is disabled. This issue will resolve the difficulties that caused them to be temporarily removed, as well as add new code for handling those events. (e.g. adding to reactions when crawling messages)

One of the issues are IDs. There are two kinds of emojis: "normal" unicode ones, which are just unicode code points, and custom emojis. While the first are simply characters, the second are Discord objects with a snowflake, associated guild, etc.

Previously I just made the ID of a unicode emoji it's ord() value. But this is problematic if you want to get_emoji(id) or perform other operations on the ID.

To resolve this, I propose making the emoji ID effectively an algebraic data type. It will have a primary key composed of two columns, a unicode character, and an ID. Exactly one of the two will be null, and the one with the value will show which type it is. For a unicode emoji, it's "ID" will be the unicode character itself.

Schema proposals:
emojis:

emoji_id emoji_unicode name category guild_id
Optional[BigInteger] Optional[Unicode(1)] String String Optional[BigInteger]

reactions:

timestamp message_id emoji_id emoji_unicode user_id channel_id guild_id
DateTime BigInteger Optional[BigInteger] Optional[Unicode(1)] BigInteger BigInteger BigInteger

Load up role info

Right now roles are only updated when events are received.
Instead we should have all pre-existing roles from all guilds added on startup.

Allow for optional data collection

Add a section to the configuration file that allows for certain features to be disabled. As of now, this should include:

  • activity changes
  • status changes
  • typing
  • reactions
  • mentions

See also #60

`docker-compose` file stores data outside of repo

Currently, the docker-compose.yml stores postgres data in /srv/docker/postgres/ on the host. It would be nice if this data were stored inside the repository instead, so that it cannot affect other applications or copies of the repository.

Audit log entries not appearing

If you look at the audit_log and audit_log_changes table, you see that the entries listed don't seem to line up with what the client shows. Checking the audit_log_crawl table shows that the last entry crawled was ID 365824096744767488, which corresponds to 2017-10-06 11:34:21.347000 UTC which seems a bit out of date.

The audit log and channel history crawlers are both using the same logic, so something weird is happening here.

Some ideas:

  • something's wrong with the audit log part of the crawling
  • audit log history fetching has some sort of fundamental difference in how it behaves
  • the scheduling of the coroutines is preventing the audit log crawler from ever doing any work

SQL Lookup Tables will not initialize

Hello! Having issues with the setup, it looks like the channel crawler is trying to reference a value from the un-initialized channels table?

The only channel that populates is 'users', the subsequent SQL insertions all fail due to the tables not being initialized correctly. Is there a setup step I missed?
Log:

[INFO] statbot.event: Initializing SQL lookup tables...
[INFO] statbot.event: Processing X users...
[INFO] statbot.event: Processing X guilds...
[INFO] statbot.event:
[INFO] statbot.event: Ready!
[INFO] statbot.sql: Looking up channel crawl progress for {server} #{channel}
[INFO] statbot.sql: Inserting new channel crawl progress for {server} #{channel}
[ERROR] statbot.sql: Exception occurred in 'with' scope!
Traceback (most recent call last):
File ".../python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
self.dialect.do_execute(
File ".../python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ForeignKeyViolation: insert or update on table "channel_crawl" violates foreign key constraint "channel_crawl_channel_id_fkey"
DETAIL: Key (channel_id)=(1002...) is not present in table "channels".

Fix "is_member" column

There are members who are reported as having "false" in the is_member column, despite being members. I believe this has to do with checking membership on startup, as the member upserts may not be changing that column.

Crawling sets first_message_id incorrectly

Something is wrong with the logic of how history() is interpreted, causing the first_message_id, what is supposed to be the very first message in a channel, to refer to an ID that is not actually the first message.

Possibly related to #25

Add foreign keys into the "lookup" tables

There are several columns like user_id and guild_id that have meaning across multiple tables. Those should be foreign keys into their respective lookup tables.

For instance, the reactions table has the following schema:
message_id | emoji_id | user_id | guild_id
After implementing this issue, message_id would be a foreign key for the messages table, emoji_id would be a foreign key for emoji_lookup, user_id for user_lookup, etc.

It will also be necessary to validate that those lookup tables are sufficiently up-to-date when a sister table is updated.

Track guild membership

Right now we don't really have a good way of checking if a person is a member of a guild.
You can lookup in role_membership when the guild ID matches joining on the roles table to find the @everyone role, but that's a huge pain, and since role information is preserved even for people who leave, it's not totally accurate.

So instead a separate table for tracking guild-specific data should be made.

Group similar operations into transactions.

Since on event (e,g, on_message) may result in the need to change multiple rows, it is more efficient and atomic if we can group it into a transaction.

The way I am thinking of doing this is to make the sql object stateful in storing the current transaction. All SQLAlchemy queries will go through some private method, say _execute(). When this method is called, if there is an active transaction, it is added. Otherwise the method will automatically query and commit it like normal via self.db.execute()

Then, when a client calls a sql object method, they will designate the creation and finish of a transaction. When signaled by another method, the transaction will attempt to commit its entire contents. If this fails, it will revert the entire transaction and print an error to the log.

It may look something like this:

with sql.transaction():
    sql.add_message(message)
    sql.add_emoji(emoji)
    # etc...

Crawler doesn't crawl unless Read Message History perms are explicitly granted

Repro: point statbot at a guild containing channels that don't have Read Message History explicitly granted to a bot user. The crawler will ignore those channels.

Consider

if channel.permissions_for(guild.me).read_message_history:

Replacing this conditional with True will cause the crawler to start indexing those channels again (although it will then fail on channels it can't read).

Add changed_roles column to channels

The voice_channels table has a changed_roles column (an ARRAY(BigInteger)) to store which roles have differing permissions for a given channel. Since this applies to normal (i.e. text) channels as well, it should also be added.

Add message_history table

The messages table tracks the current state of all messages. However any information in edits is lost. We should add a message_history table with the following schema:

Column Name Type Other
message_id BigInteger Primary key, foreign key messages.message_id
date DateTime Primary key
content UnicodeText
embeds JSON

This way we preserve the existing messages table and also don't duplicate tons of information over several rows. For instance, the message_type or channel_id of a message won't change over edits, and so can be fetched by JOINing on messages.

Add audit log information to statbot

Like channel messages, getting data from the audit log is like crawling through the history, except there are no events.
What I propose is a dedicated async task that fetches any holes in the records. Then when all the history is fetched, it waits for a long period, then does a fetch for the newly created hole (present to last added).

This also involves designing a schema for how audit log information will be stored relationally.

Record edit time in messages table

Right now the messages table has an is_edited column, but it does not store the time that it was edited.

My proposal is to add a datetime column called timestamp. If a post is new, this column is identical to the value extracted from the message snowflake. If the message has been edited or deleted, this column is updated to when that event occurred.

Fails to get emoji guild

I get the following error when trying to do upsert_emoji():

Traceback (most recent call last):
  File "/home/ammon/.local/lib/python3.6/site-packages/discord/client.py", line 197, in _run_event
    yield from coro(*args, **kwargs)
  File "/home/ammon/Git/statbot/statbot/client.py", line 174, in on_reaction_add
    async def on_reaction_clear(self, message, reactions):
  File "/home/ammon/Git/statbot/statbot/sql.py", line 344, in add_reaction
    'channel_id': message.channel.id,
  File "/home/ammon/Git/statbot/statbot/sql.py", line 577, in upsert_emoji
  File "/home/ammon/Git/statbot/statbot/sql.py", line 218, in _emoji_values
    self.db.execute(ups)
AttributeError: 'PartialReactionEmoji' object has no attribute 'guild'

Design and implement a history crawler

Currently the bot only has an event handler that listens for new messages, etc. However this will deny it the opportunity to ingest past data. The proposal is to have a separate async task or thread that keeps track of which messages it has already read, and to scan through ones it has not to add them to the database.

This crawler will need to be fault-tolerant, allowing it to pick up where it left off in the event of an ungraceful exit.
It should be able to patch holes, knowing which sections of history it has already covered and fill in areas that it hasn't covered yet.
It should be modular and extensible, with hooks that it runs on each message, and modify tables indirectly through the sql handler object.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.