steemit / hivemind Goto Github PK

Developer-friendly microservice powering social networks on the Steem blockchain.

License: MIT License

Makefile 0.48% Python 96.55% Shell 2.19% Ruby 0.39% Dockerfile 0.39%

hivemind's Introduction

Hivemind [BETA]

Developer-friendly microservice powering social networks on the Steem blockchain.

Hive is a "consensus interpretation" layer for the Steem blockchain, maintaining the state of social features such as post feeds, follows, and communities. Written in Python, it synchronizes an SQL database with chain state, providing developers with a more flexible/extensible alternative to the raw steemd API.

Development Environment

Python 3.6 required
Postgres 10+ recommended

Dependencies:

OSX: $ brew install python3 postgresql
Ubuntu: $ sudo apt-get install python3 python3-pip postgresql

Installation:

Please note that you would need to create a user, set password, and assign a role in postgres before continue.

$ createdb hive
$ export DATABASE_URL=postgresql://user:pass@localhost:5432/hive

$ git clone https://github.com/steemit/hivemind.git
$ cd hivemind
$ pip3 install -e .[test]

Start the indexer:

$ hive sync

$ hive status
{'db_head_block': 19930833, 'db_head_time': '2018-02-16 21:37:36', 'db_head_age': 10}

Start the server:

$ hive server

$ curl --data '{"jsonrpc":"2.0","id":0,"method":"hive.db_head_state","params":{}}' http://localhost:8080
{"jsonrpc": "2.0", "result": {"db_head_block": 19930795, "db_head_time": "2018-02-16 21:35:42", "db_head_age": 10}, "id": 0}

Run tests:

$ make test

Production Environment

Hivemind is deployed as a Docker container.

Here is an example command that will initialize the DB schema and start the syncing process:

docker run -d --name hivemind --env DATABASE_URL=postgresql://user:pass@hostname:5432/databasename --env STEEMD_URL=https://yoursteemnode --env SYNC_SERVICE=1 -p 8080:8080 steemit/hivemind:latest

Be sure to set DATABASE_URL to point to your postgres database and STEEMD_URL to point to your steemd node to sync from.

Once the database is synced, Hivemind will be available for serving requests.

To follow along the logs, use this:

docker logs -f hivemind

Configuration

Environment	CLI argument	Default
`LOG_LEVEL`	`--log-level`	INFO
`HTTP_SERVER_PORT`	`--http-server-port`	8080
`DATABASE_URL`	`--database-url`	postgresql://user:pass@localhost:5432/hive
`STEEMD_URL`	`--steemd-url`	https://api.steemit.com
`REDIS_URL`	`--redis-url`	redis://localhost:6379/
`MAX_BATCH`	`--max-batch`	50
`MAX_WORKERS`	`--max-workers`	4
`TRAIL_BLOCKS`	`--trail-blocks`	2
`RECOMMEND_COMMUNITIES`	`--recommend-communities`	hive-108451,hive-172186,hive-187187

Precedence: CLI over ENV over hive.conf. Check hive --help for details.

Requirements

Hardware

Focus on Postgres performance
2.5GB of memory for hive sync process
250GB storage for database

Steem config

Build flags

LOW_MEMORY_NODE=OFF - need post content
CLEAR_VOTES=OFF - need all vote data
SKIP_BY_TX=ON - tx lookup not used

Plugins

Required: reputation reputation_api database_api condenser_api block_api
Not required: follow*, tags*, market_history, account_history, witness

Postgres Performance

For a system with 16G of memory, here's a good start:

effective_cache_size = 12GB # 50-75% of avail memory
maintenance_work_mem = 2GB
random_page_cost = 1.0      # assuming SSD storage
shared_buffers = 4GB        # 25% of memory
work_mem = 512MB
synchronous_commit = off
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
max_wal_size = 4GB

JSON-RPC API

The minimum viable API is to remove the requirement for the follow and tags plugins (now rolled into condenser_api) from the backend node while still being able to power condenser's non-wallet features. Thus, this is the core API set:

condenser_api.get_followers
condenser_api.get_following
condenser_api.get_followers_by_page
condenser_api.get_following_by_page
condenser_api.get_follow_count

condenser_api.get_content
condenser_api.get_content_replies

condenser_api.get_state

condenser_api.get_trending_tags

condenser_api.get_discussions_by_trending
condenser_api.get_discussions_by_hot
condenser_api.get_discussions_by_promoted
condenser_api.get_discussions_by_created

condenser_api.get_discussions_by_blog
condenser_api.get_discussions_by_feed
condenser_api.get_discussions_by_comments
condenser_api.get_replies_by_last_update

condenser_api.get_blog
condenser_api.get_blog_entries
condenser_api.get_discussions_by_author_before_date

Overview

History

Initially, the steemit.com app was powered exclusively by steemd nodes. It was purely a client-side app without any backend other than a public and permissionless API node. As powerful as this model is, there are two issues: (a) maintaining UI-specific indices/APIs becomes expensive when tightly coupled to critical consensus nodes; and (b) frontend developers must be able to iterate quickly and access data in flexible and creative ways without writing C++.

To relieve backend and frontend pressure, non-consensus and frontend-oriented concerns can be decoupled from steemd itself. This (a) allows the consensus node to focus on scalability and reliability, and (b) allows the frontend to maintain its own state layer, allowing for flexibility not feasible otherwise.

Specifically, the goal is to completely remove the follow and tags plugins, as well as get_state from the backend node itself, and re-implement them in hive. In doing so, we form the foundational infrastructure on which to implement communities and more.

Purpose

Hive tracks posts, relationships, social actions, custom operations, and derived states.

discussions: by blog, trending, hot, created, etc
communities: mod roles/actions, members, feeds (in 1.5; spec)
accounts: normalized profile data, reputation
feeds: un/follows and un/reblogs

Hive does not track most blockchain operations.

For anything to do with wallets, orders, escrow, keys, recovery, or account history, query SBDS or steemd.

Hive can be extended or leveraged to create:

reactions, bookmarks
comment on reblogs
indexing custom profile data
reorganize old posts (categorize, filter, hide/show)
voting/polls (democratic or burn/send to vote)
modlists: (e.g. spammy, abuse, badtaste)
crowdsourced metadata
mentions indexing
full-text search
follow lists
bot tracking
mini-games
community bots

Core indexer

Ingests blocks sequentially, processing operations relevant to accounts, post creations/edits/deletes, and custom_json ops for follows, reblogs, and communities. From these we build account and post lookup tables, follow/reblog state, and communities/members data. Built exclusively from raw blocks, it becomes the ground truth for internal state. Hive does not reimplement logic required for deriving payout values, reputation, and other statistics which are much more easily attained from steemd itself in the cache layer.

Cache layer

Synchronizes the latest state of posts and users, allowing us to serve discussions and lists of posts with all expected information (title, preview, image, payout, votes, etc) without needing steemd. This layer is first built once the initial core indexing is complete. Incoming blocks trigger cache updates (including recalculation of trending score) for any posts referenced in comment or vote operations. There is a sweep to paid out posts to ensure they are updated in full with their final state.

API layer

Performs queries against the core and cache tables, merging them into a response in such a way that the frontend will not need to perform any additional calls to steemd itself. The initial API simply mimics steemd's condenser_api for backwards compatibility, but will be extended to leverage new opportunities and simplify application development.

Fork Resolution

Latency vs. consistency vs. complexity

The easiest way to avoid forks is to only index up to the last irreversible block, but the delay is too much where users expect quick feedback, e.g. votes and live discussions. We can apply the following approach:

Follow the chain as closely to head_block as possible
Indexer trails a few blocks behind, by no more than 6s - 9s
If missed blocks detected, back off from head_block
Database constraints on block linking to detect failure asap
If a fork is encountered between hive_head and steem_head, trivial recovery
Otherwise, pop blocks until in sync. Inconsistent state possible but rare for TRAIL_BLOCKS > 1.
A separate service with a greater follow distance creates periodic snapshots

Documentation

$ make docs && open docs/hive/index.html

License

MIT

hivemind's People

Contributors

Stargazers

Watchers

Forkers

netherdrake phoneman13 random-labs shasthojoy btcpimp dito24 voxchain arpwv s-n-d-p free999enigma pir8aye project-mansa microservice-s dpays tokenaires roadscape dsites arkadiuszsz zerocry inertia186 singhpratyush eraydurmus hobbit19 tiotdev g3niusmind travelfeed-io bkgithubbk jolly-pirate graylan0 syvb imwatsi quochuy mariuszkarowski fagan2888 abwasserrohr steemclassic satren-witness louis-88 blocktradesdevs raipat victorvonfrankenstein kingdotnet drov0 reverendrum gandalf-the-grey jwrct starsakary talhasch fulltimegeek alexbelozersky sshyran ederaleng classicvalues woxfi pinkdiamond1 doctorlai rexthetech omnisteemit happyberrysboy faisalamin9696 only-dev-time steemchiller boylikegirl carlchaocarl

hivemind's Issues

Add Migrations

I tried updating hive to the latest version, but the schema is out of date.

2017-09-28T10:24:00.298529363Z sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'is_valid' in 'field list'")

Could you please add db migrations, or at least publish a simple textfile with SQL required to upgrade /w dates or commit hashes?

community implementation

creation op
custom ops, verify auths
API methods
- get_community(name): admins, mods, descriptions, settings
- list_communities(start, sort) (name/trending/subs)
- list_tags(start, sort) (trending)
- list_blogs(start, sort)
- get_user_subscriptions(account, start, sort)
- search: tag/user/community/title
- (more)
internal methods
- top authors in community (trending)
- top curators in community
- top blogs
- next 24h reward pool
- community rank
frontend
test/profile
steemit.com "global" acct (subs, follows, mutes?)

bonus

daily stats: subs, rank
- top authors (d/w/m)

evaluate:

self as null-comm? allows sub to trending blogs
- does this impact blog-follow?

fast block sync w/ jussi

Add new sync strategy: sync from jussi, if it's configured. It allows us to request blocks in large batches.

dockerfile is divergent

Why is the python base being used in dockerfile instead of the same ubuntu-based baseimage we're using in sbds? Reusing infrastructure wherever possible is preferred.

sbds jsonrpc lib: "Attempt to overwrite %r in LogRecord"

Saw several of these occurring in dev, not sure what the cause is:

Traceback (most recent call last):  
  File "/usr/local/lib/python3.5/dist-packages/bottle.py", line 862, in _handle  
    return route.call(**args)  
  File "/usr/local/lib/python3.5/dist-packages/bottle.py", line 1740, in wrapper  
    rv = callback(*a, **ka)  
  File "/app/hive/sbds/jsonrpc.py", line 64, in rpc  
    self.logger.error('Parse Error, Not JSON', extra=request.body.read())  
  File "/usr/lib/python3.5/logging/__init__.py", line 1308, in error  
    self._log(ERROR, msg, args, **kwargs)  
  File "/usr/lib/python3.5/logging/__init__.py", line 1414, in _log  
    exc_info, func, extra, sinfo)  
  File "/usr/lib/python3.5/logging/__init__.py", line 1388, in makeRecord  
    raise KeyError("Attempt to overwrite %r in LogRecord" % key)  
KeyError: 'Attempt to overwrite 109 in LogRecord'

KeyError: 'Attempt to overwrite 51 in LogRecord'

etc

steemd-compatible APIs

Responses should be as close as possible to steemd, making it easy for devs to switch to hive APIs.

fork recovery

in case of fork:

determine fork block
pop forked block
- need to delete associated post/account records?
sync to head
- filling in and/or overwriting fork data

Exception: found cache gap: 28009430 --> 28009433 (1)

Within a period of 29 blocks, post 28009432 was created, deleted, and re-created. Sync detected that it was about to skip indexing this post and aborted.

The solution is likely to sort cache list by id before writing. Deleted posts may be breaking the assumption about sequential inserts when we process blocks in batch.

docker out-of-data-space bug

This happens sometimes. Possibly due to db error log?

Oct 11 08:14:11 docker/cd60a907a168[7347]: [SYNC] Got block 16231286 (68.7/s, 1096rps 73wps) -- 0.0m remaining
Oct 11 08:14:11 docker/cd60a907a168[7347]: [INIT] *** Initial sync complete. Rebuilding cache. ***
Oct 11 08:14:11 docker/cd60a907a168[7347]: [INIT] Found 15034490 missing post cache entries
Oct 11 08:14:56 kernel: [1692646.139464] device-mapper: thin: 253:2: switching pool to out-of-data-space (error IO) mode
Oct 11 08:14:56 kernel: [1692646.144088] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2519552)
Oct 11 08:14:56 kernel: [1692646.150472] Buffer I/O error on device dm-4, logical block 2519552
Oct 11 08:14:56 kernel: [1692646.153929] Buffer I/O error on device dm-4, logical block 2519553
Oct 11 08:14:56 kernel: [1692646.157063] Buffer I/O error on device dm-4, logical block 2519554
Oct 11 08:14:56 kernel: [1692646.160150] Buffer I/O error on device dm-4, logical block 2519555
Oct 11 08:14:56 kernel: [1692646.163072] Buffer I/O error on device dm-4, logical block 2519556
Oct 11 08:14:56 kernel: [1692646.165959] Buffer I/O error on device dm-4, logical block 2519557
Oct 11 08:14:56 kernel: [1692646.168900] Buffer I/O error on device dm-4, logical block 2519558
Oct 11 08:14:56 kernel: [1692646.171721] Buffer I/O error on device dm-4, logical block 2519559
Oct 11 08:14:56 kernel: [1692646.174681] Buffer I/O error on device dm-4, logical block 2519560
Oct 11 08:14:56 kernel: [1692646.177909] Buffer I/O error on device dm-4, logical block 2519561
Oct 11 08:14:56 kernel: [1692646.181019] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2519808)
Oct 11 08:14:56 kernel: [1692646.187836] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2520064)
Oct 11 08:14:56 kernel: [1692646.194356] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2520320)
Oct 11 08:14:56 kernel: [1692646.200836] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2520576)
Oct 11 08:14:56 kernel: [1692646.207192] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1048576000 size 8388608 starting block 2520832)
Oct 11 08:14:56 kernel: [1692646.214593] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1056964608 size 8388608 starting block 2521088)
Oct 11 08:14:56 kernel: [1692646.221018] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1056964608 size 8388608 starting block 2521344)
Oct 11 08:14:56 kernel: [1692646.227704] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1056964608 size 8388608 starting block 2521600)
Oct 11 08:14:56 kernel: [1692646.234261] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1056964608 size 8388608 starting block 2521856)
Oct 11 08:14:56 kernel: [1692646.271736] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 11 08:15:05 kernel: [1692655.338808] EXT4-fs warning: 287 callbacks suppressed
Oct 11 08:15:05 kernel: [1692655.342246] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 0 size 0 starting block 2519552)
Oct 11 08:15:05 kernel: [1692655.351639] buffer_io_error: 75798 callbacks suppressed
Oct 11 08:15:05 kernel: [1692655.354860] Buffer I/O error on device dm-4, logical block 2519552
Oct 11 08:15:05 kernel: [1692655.358579] Buffer I/O error on device dm-4, logical block 2519553
Oct 11 08:15:05 kernel: [1692655.362446] Buffer I/O error on device dm-4, logical block 2519554
Oct 11 08:15:05 kernel: [1692655.366161] Buffer I/O error on device dm-4, logical block 2519555
Oct 11 08:15:05 kernel: [1692655.370238] Buffer I/O error on device dm-4, logical block 2519556
Oct 11 08:15:05 kernel: [1692655.373862] Buffer I/O error on device dm-4, logical block 2519557
Oct 11 08:15:05 kernel: [1692655.377500] Buffer I/O error on device dm-4, logical block 2519558
Oct 11 08:15:05 kernel: [1692655.381224] Buffer I/O error on device dm-4, logical block 2519559
Oct 11 08:15:05 kernel: [1692655.384238] Buffer I/O error on device dm-4, logical block 2519560
Oct 11 08:15:05 kernel: [1692655.387449] Buffer I/O error on device dm-4, logical block 2519561
Oct 11 08:15:05 kernel: [1692655.390647] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2519808)
Oct 11 08:15:05 kernel: [1692655.402506] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2520064)
Oct 11 08:15:05 kernel: [1692655.411143] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2520320)
Oct 11 08:15:05 kernel: [1692655.420756] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2520576)
Oct 11 08:15:05 kernel: [1692655.429972] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2520832)
Oct 11 08:15:05 kernel: [1692655.439005] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2521088)
Oct 11 08:15:05 kernel: [1692655.448575] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2521344)
Oct 11 08:15:05 kernel: [1692655.457532] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2521600)
Oct 11 08:15:05 kernel: [1692655.466855] EXT4-fs warning (device dm-4): ext4_end_bio:314: I/O error -28 writing to inode 528049 (offset 1361182720 size 6160384 starting block 2521856)
Oct 11 08:15:06 kernel: [1692655.516649] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 11 08:15:09 docker/cd60a907a168[7347]: #033[93m[SQL][58385ms] SELECT id, author, permlink FROM hive_posts WHERE is_deleted = 0 AND id > (SELECT IFNULL(MAX(post_id), 0) FROM hive_posts_cache) ORDER BY id LIMIT 1000000#033[0m
Oct 11 08:15:15 docker/cd60a907a168[7347]: Traceback (most recent call last):
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
Oct 11 08:15:15 docker/cd60a907a168[7347]:     context)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute
Oct 11 08:15:15 docker/cd60a907a168[7347]:     cursor.execute(statement, parameters)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/cursors.py", line 250, in execute
Oct 11 08:15:15 docker/cd60a907a168[7347]:     self.errorhandler(self, exc, value)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/connections.py", line 50, in defaulterrorhandler
Oct 11 08:15:15 docker/cd60a907a168[7347]:     raise errorvalue
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/cursors.py", line 247, in execute
Oct 11 08:15:15 docker/cd60a907a168[7347]:     res = self._query(query)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/cursors.py", line 411, in _query
Oct 11 08:15:15 docker/cd60a907a168[7347]:     rowcount = self._do_query(q)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/cursors.py", line 374, in _do_query
Oct 11 08:15:15 docker/cd60a907a168[7347]:     db.query(q)
Oct 11 08:15:15 docker/cd60a907a168[7347]:   File "/usr/local/lib/python3.5/dist-packages/MySQLdb/connections.py", line 277, in query
Oct 11 08:15:15 docker/cd60a907a168[7347]:     _mysql.connection.query(self, query)
Oct 11 08:15:15 docker/cd60a907a168[7347]: _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1")
Oct 11 08:15:15 docker/cd60a907a168[7347]:
Oct 11 08:15:15 docker/cd60a907a168[7347]: The above exception was the direct cause of the following exception:
Oct 11 08:15:15 docker/cd60a907a168[7347]:

cursor-style pagination?

currently with steemd all "posts" queries take a start (author,permlink) as a cursor to load successive pages from. hive just uses offset/limit. if we want to replicate existing APIs (nearly) 100% we would need to add the cursor option. this involves a bit more complexity because we'll need to perform a lookup on the column we're ordering by based on the author-permlink provided to know on which value to start. aside from the extra queries an upside is that seeking could be more efficient than simple offset; it's more infinite-scroll friendly; and, may be more consistent when result rows are 'moving' around.

for reference:

struct discussion_query {
   void validate()const{
      FC_ASSERT( filter_tags.find(tag) == filter_tags.end() );
      FC_ASSERT( limit <= 100 );
   }

   string           tag;
   uint32_t         limit = 0;
   set<string>      filter_tags;
   set<string>      select_authors; ///< list of authors to include, posts not by this author are filtered
   set<string>      select_tags; ///< list of tags to include, posts without these tags are filtered
   uint32_t         truncate_body = 0; ///< the number of bytes of the post body to return, 0 for all
   optional<string> start_author;
   optional<string> start_permlink;
   optional<string> parent_author;
   optional<string> parent_permlink;
};

account cache

store:

rep (and remove from posts.votes? or preprocess?)
followers
following
proxy weight
voting weight
join date
profile fields (json?)

frontend implementation

In condenser we need to replace:

existing discussion API's, including blog/feeds (get_discussions_by_X, get_blog_feed, get_user_feed)
follows APIs (get_followers/get_following)
discussion threads APIs (get_state)

get hive up in dev/stage/prod #30
add hive support to steem-js steemit/steem-js#305
integrate hive into condenser steemit/condenser#2180

jussi-awareness

Hivemind needs to know if jussi is configured so it can upgrade to batch requests for certain calls.

finalize db schema

todo:

bool fields

consider:

use INT ids instead of varchar(16) account names
use block_num instead of timestamp

hive_posts_cache:

few missing fields: depth, get_post_stats vals
bonus: distinguish simple vote updates (payout/ranking fields) from body/thread updates

structured logging

Currently hive indexer logging is basic console output, though tuned to produce useful output at the INFO level at a reasonable volume. Hive server logging could use some work. And generally there's a mix of prints and logger.getLogger, with hacks for other packages' noisy loggers (sa & jsonrpcserver). Hive's logging would benefit from a refactoring and polishing.

standard logging solution -- untangle logger config, or use other package?
tune hive.server logging -- each request is logged; need to track errors but control spam
json-based logging? (jg)

yo uses http://www.structlog.org/en/stable/

errors w/ api server

https://hive.steemitdev.com/health
https://hive.steemitstage.com/health
both return 500

audit steemd field usage

schema changes

cached_posts.user_agent
cached_posts.lang
cached_posts.canonical_url
url field instead of split (keep uniq constraint)
- author -> author_id
- author/permlink -> url
- ~~compressed to max length of 256 for mysql? (approx 191 in base-122 possible)~~
accounts.created_by
use block_num over created_at where possible
- and consider posts.deleted_at_block, etc

Leftover items from #242 to evaluate:

Tier 1

hpc.author_id [smaller idxs, faster lookups]

Tier 2

legacy mutes migration (iredeemables, rep<0)
catchall cid for blogs or legacy
comm stat: # of unique authors (pending)
hpc.parent_id / parent_author / parent_permlink
hpc.parent_author_id (only needed for replies)
hpc.root_author_id (blog stats)
hpc.root_post_id [disc retrieval, stats]
hpc.tags (list)

index state management

evaluate:

~~use vops rather than SQL to detect payouts~~
- vops could also be used to track top comm curators
- call overhead probably prohibitive
how to quickly & durably mark cached post dirty
~~fetch post state on requests~~ not a good idea; reads > writes
~~track steemd post_id if batch fetching is an option~~ store in raw_json for now
- makes root_comment useful

env vars

Where to use env vars vs. args?

appbase flag
consistent use of db_url
steemd vs jussi flag

missing post cache records

Encountered a case where some rows were missing from the posts cache table. This should not be possible because rows are always written in sequence, within a transaction. It's possible this was a side-effect of dev testing.

parallel fetch

Particularly for get_content bottleneck.

update readme / docs

maybe:

~~update hive_api and associated schema (check w/ relativityboy)~~ out of scope; cont. in #92

deploy hivemindsync to aws

We need a process which will periodically save db snapshots to be used for quickly launching new instances

stats tables

in the future we may need a summary table to assist certain queries.

potentially

avg rshares
payouts
posts
accounts (active)
comments
txs/ops?
posts_per_account
comments_per_account

tables

stat_hour
stat_day

Evaluate replacing steem-python

Currently, the software relies extensively on 3 specific API calls: get_content, get_block, and get_dynamic_global_properties. It calls these through steem-python, with which there have been issues that may affect HA. It also doesn't support websockets. Evaluate whether if it's worth to run the whole stack vs purpose-specific API routines.

some deleted posts not updating

21 specific records on 2018-01-22 from 19:33:36 to 22:03:51 are refusing to update.

SELECT * FROM hive_posts_cache WHERE is_paidout = '0' AND payout_at < '2018-01-24 16:33:42'

performance profiling

This issue is a container for various test results

~~increase content fetch speed or~~ decrease per-block post updates
get live block processing time below 500ms

minimize dupe data in raw_json cols

mysql tuning

table stats:

hive_feed_cache works well with MEMORY engine. had to set:

tmp_table_size=2G
max_heap_table_size=2G

batching workers

For bulk requests, add the ability to retrieve batches in parallel, and determine optimal batch/worker sizes for jussi/steemd.

implement
balance parameters
test resync

thread fetching API

need an API that works similarly to get_state for fetching full discussions along with all relevant commenters' metadata

blacklist spec

Spec out the classes of blacklists and how to read/write

db tunings: test, apply, document

Evaluate:

postgres config tuning (mem usage, autovac)
~~deferring of constraints~~ (moved to #95: initial sync perf)
~~disabling indexes during initial sync~~ (moved to #95 -- initial sync perf)
sync index tuning (done; read-side to be handled in #93)
~~table partitioning~~

block streaming method w/ trailing

Implement a block streamer which lets us trail N blocks behind head, possibly adjusting to network conditions.

temporary workarounds

revert this commit once steem-python is updated on pypi:
2159f6a

revert this commit when steem-python stops writing to disk (or if removed from server completely)
e2b48a9

re-evaluate this change to healthcheck (done to remove reliance on steem-python)
41d8ca6

encoding issue on a strange (invalid?) username in follow history

I guess the follow plugin doesn't necessarily enforce constraints on what is a valid username, and there is one with Ȃ in it.

It breaks hive indexer like this:

INFO:sqlalchemy.engine.base.Engine:
        INSERT IGNORE INTO hive_follows (follower, following, created_at, state)
        VALUES (%s, %s, %s, %s) ON DUPLICATE KEY UPDATE state = %s
        
INFO:sqlalchemy.engine.base.Engine:('tuakanamorgan', 'najem\u202c', '2017-06-27T20:42:03', 1, 1)
INFO:sqlalchemy.engine.base.Engine:ROLLBACK
Traceback (most recent call last):
  File "/usr/local/bin/hive", line 9, in <module>
    load_entry_point('hivemind', 'console_scripts', 'hive')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/app/hive/indexer/cli.py", line 22, in index_from_steemd
    run()
  File "/app/hive/indexer/core.py", line 421, in run
    sync_from_steemd(is_initial_sync)
  File "/app/hive/indexer/core.py", line 335, in sync_from_steemd
    dirty |= process_blocks(blocks, is_initial_sync)
  File "/app/hive/indexer/core.py", line 278, in process_blocks
    dirty |= process_block(block, is_initial_sync)
  File "/app/hive/indexer/core.py", line 264, in process_block
    process_json_follow_op(account, op_json, date)
  File "/app/hive/indexer/core.py", line 173, in process_json_follow_op
    query(sql, fr=follower, fg=following, at=block_date, state=state)
  File "/app/hive/db/methods.py", line 17, in query
    res = conn.execute(query, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1405, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/compat.py", line 187, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.5/dist-packages/MySQLdb/cursors.py", line 234, in execute
    args = tuple(map(db.literal, args))
  File "/usr/local/lib/python3.5/dist-packages/MySQLdb/connections.py", line 318, in literal
    s = self.escape(o, self.encoders)
  File "/usr/local/lib/python3.5/dist-packages/MySQLdb/connections.py", line 225, in unicode_literal
    return db.string_literal(str(u).encode(db.encoding))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u202c' in position 5: ordinal not in range(256)

OPTIONS call support

The issue is triggered by a slight (and appropriate) tightening of the rpc request in steem-js. The following was added, which is triggering Chrome to decide it needs to sent an OPTIONS call before the actual post:
   headers: {
     Accept: 'application/json, text/plain, */*',
    'Content-Type': 'application/json',
   },

https://github.com/aio-libs/aiohttp-cors looks promising.

OPTIONS call support is part of the HTTP standard. If our services properly support HTTP it will make development going forward more predictable, and stable.

It also currently eases development when using tools that assume full HTTP support, without needing to setup and run a local jussi instance.

add postgres compatibility

work began in https://github.com/steemit/hivemind/tree/postgres branch

Implement json-schema docs for hive api

Use hive_api.py as the starting point.

some SQLAlchemy dialect-specific data types are preventing sqlite support

Example from hivemind/hive/db/schema.py

from sqlalchemy.dialects.mysql import (
    CHAR, SMALLINT, TINYINT,
    TINYTEXT, MEDIUMTEXT, DOUBLE,
)

verify max memory usage

Hive's max memory usage should be no more than 1-2GB currently; let's set a limit (at least in dev) to verify.

https://docs.python.org/2/library/resource.html

Related to #49

pinned posts

As a user of sufficient privilege within a community, I want to be able to pin 1 or more posts to the top of that community's feed; in an order chosen by me.

resync on wake

if the process is behind on blocks by a certain threshold it needs to go back to the sync routine

reblog comments

Hive could accept a comment along with a reblog. How would we handle multi-reblogs?

Top level dropdowns do not accept theme-ing.

Condenser's current top-level dropdowns exist outside the theming context. They cannot respond to 'night mode' theme changes (or any others for that matter)

Though theme-dark is set, the dropdown exists outside the regular tree.

Two possible solutions (of many) add the theme class to either the <body> tag, or the root of the tree that holds the dropdown.

body:

dropdown root:

api must return resteem status

Currently in condenser, resteem state is stored client-side. This makes it minimally usable but unreliable if it is to support un-resteeming.

The naive solution would be to return all resteeming accounts for each requested post, just as votes are currently. A better solution would be to specify an account context and return user-specific state from hive.

block eta

Currently there is a loop which checks dgp until expected block is detected. This makes a lot of calls to jussi (and returned data is imprecise because calls cached by 3s). Each call can be up to 100-200ms which is a significant waste. Since blocks are at regular 3s intervals, hive should just request blocks at their ETA.

Also, once we have a known amount of idle time between blocks, it can be used for maintenance.

use is_valid flag for community posts

Provides more flexibility in how we interpret data, and leaves possibilities open for UI and reports.

It doesn't make sense to force-override the community field for comments, since they would not be accessible anyway. They should still be part of the tree, just flagged automatically.

For root posts, it's not yet clear which is the ideal approach.

evaluate rep score replacement

One way to approximate the current rep score is to sum of all of a user's posts net_rshares. Where this differs from follow plugin is that users with lower rep can bring down those with higher rep.

In the short term it may be ideal to keep it in steemd (see steemit/steem#1425). Long term, we may want to implement some other algorithm entirely.

steemit / hivemind Goto Github PK

hivemind's Introduction

Hivemind [BETA]

Developer-friendly microservice powering social networks on the Steem blockchain.

Development Environment

Production Environment

Configuration

Requirements

Hardware

Steem config

Postgres Performance

JSON-RPC API

Overview

History

Purpose

Hive tracks posts, relationships, social actions, custom operations, and derived states.

Hive does not track most blockchain operations.

Hive can be extended or leveraged to create:

Core indexer

Cache layer

API layer

Fork Resolution

Documentation

License

hivemind's People

Contributors

Stargazers

Watchers

Forkers

hivemind's Issues

Recommend Projects

Recommend Topics

Recommend Org