graphops / poi-radio Goto Github PK
View Code? Open in Web Editor NEWPOI Radio monitors subgraph data integrity in real time using Graphcast SDK
Home Page: https://docs.graphops.xyz/graphcast/radios/poi-radio
License: Apache License 2.0
POI Radio monitors subgraph data integrity in real time using Graphcast SDK
Home Page: https://docs.graphops.xyz/graphcast/radios/poi-radio
License: Apache License 2.0
Describe the bug
Both Suntzu Indexer and Data Nexus have encountered this issue:
thread 'main' panicked at 'Could not validate the supplied configurations: Validate the input: Graph node endpoint must be able to serve indexing statuses query: error decoding response body: invalid type: null, expected a string at line 1 column 2868', src/main.rs:45:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This happens right after the Graphcast ID resolves to a valid Indexer address.
It's happening both when running the Radio as a binary and when running it inside a Docker image.
These were the config variables used:
/usr/local/bin/poi-radio \
--graphcast-network "mainnet" \
--registry-subgraph "https://api.thegraph.com/subgraphs/name/hopeyen/graphcast-registry-mainnet" \
--network-subgraph "https://gateway.thegraph.com/network" \
--private-key "priv_key" \
--graph-node-endpoint "http://graph-node-0:8030/graphql"
Additional context: the graph node instance was not on the same machine as the POI Radio instance.
Describe the bug
Registry contract does not resolve Indexer address from Graphcast ID
To Reproduce
When running POI Radio with a valid private key for a Graphcast ID address, the Radio properly derives the address
DEBUG graphcast_sdk: Wallet address: 0xd8b0a336a27e57dd163d19e49bb153c631c49697
But then does not resolve that to an Indexer address
INFO poi_radio: Acting on behave of indexer None with stake 0
When the address is registered in the registry contract.
Environment variables used:
PRIVATE_KEY="GRAPHCAST_ID_PRIVATE_KEY"
GRAPH_NODE_STATUS_ENDPOINT="http://host.docker.internal:8030/graphql"
REGISTRY_SUBGRAPH="https://api.thegraph.com/subgraphs/name/hopeyen/graphcast-registry-goerli"
NETWORK_SUBGRAPH="https://gateway.testnet.thegraph.com/network"
GRAPHCAST_NETWORK=testnet
RUST_LOG="off,hyper=off,graphcast_sdk=debug,poi_radio=debug,integration_tests=debug"
Additional context
This is the same issue that was encountered during the IndexerDAO workshop on Mar 13
Describe the bug
From indexer Suntzu's report, we found that the radio can fail or get stuck silently.
The most recent logs were 2 days before the time of report which the radio has not exited the program. The most recent 100 logs are all gowaku node, and it stopped when the last peer was disconnected.
output:
2023-04-25T16:59:09.106Z INFO gowaku.node2.filter filter/waku_filter.go:137 received a message push {"fullNode": false, "peer": "16Uiu2HAm5uqfdh7z2YTEps2MhvsXTk3uvSHZ9AtVkzipZZGbKJEL", "messages": 1}
2023-04-25T16:59:16.307Z INFO gowaku.node2 node/connectedness.go:68 peer disconnected {"peer": "16Uiu2HAm5uqfdh7z2YTEps2MhvsXTk3uvSHZ9AtVkzipZZGbKJEL"}
In theory, main loop will periodically update the states of network and subgraphs with logs printed, and will reconnect to the peers that went on/offline. We don't see that here so the event might be stuck/silently failed.
Expected behavior
Radio should
Describe the bug
When building the newest go-waku as part of the bindings compilation on M1, we see linker issues:
/Users/petko/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/aarch64-apple-darwin/lib" "-o" "/Users/petko/work/poi-radio-e2e-tests/target/debug/deps/poi_radio_e2e_tests-8b198dc3a9aab04f" "-Wl,-dead_strip" "-nodefaultlibs"
= note: Undefined symbols for architecture arm64:
"_FSEventStreamCreate", referenced from:
__cgo_9d3be7b4e652_Cfunc_EventStreamCreate in libwaku_sys-9f01d778247c290b.rlib(000039.o)
"_FSEventStreamInvalidate", referenced from:
__cgo_9d3be7b4e652_Cfunc_FSEventStreamInvalidate in libwaku_sys-9f01d778247c290b.rlib(000039.o)
...
This shouldn't affect dev
or our Docker images since the Cargo.lock is on an older version of go-waku there.
We've reported this to the waku team and are waiting for fixes/suggestions.
Problem statement
Allow the docs to be discoverable when starting from this github repo.
Expectation proposal
Add a link to https://docs.graphops.xyz/graphcast/radios/poi-radio
Alternative considerations
Do nothing. Make me fumble around.
Additional context
none
Describe the bug
@stake-machine has reported seeing this topic generation error:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: []', /poi-radio/src/lib.rs:120:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-04-22T09:00:04.005612Z ERROR poi_radio: Topic generation error: HTTP status server error (502 Bad Gateway) for url (https://gateway.thegraph.com/network)
at src/lib.rs:117
Expected behavior
Perhaps a retry mechanism or soft failure would be more suitable
Problem statement
Currently the block provider is required to trigger a query to the graph node and construct the POI message. For multichain support, different block providers are needed for each indexing networks. This creates overhead on matching providers with correct subgraph indexing networks.
Expectation proposal
block_interval
indexingStatuses
for all active allocations (later can update to all indexing deployments), and loop for each allocation's resultschainhead_block
information is the same, thus message_block
calculation would be consistent across allocations within a NETWORKmessage_block = chainhead_block - chainhead_block % block_interval
latest_block >= message_block
, query poi using message_block
number and hash and send messageAlternative considerations
If chainhead_block
has not progressed for enough block_interval
since the last polling period, the same messages would be sent again. I'm ignoring this case for now because the logic to track previous message block sent across allocations on the same network need to be thought through a bit more
Ideas for later on,
Network
struct, add fields for prev_message_block
to track the last determined block to send message on.prev_message_block
initialized to 0.prev_message_block
. This cannot happen within the message send loop since one update will affect the subsequent allocations on the same network. (We could get away by threading or group the allocations by networks and nest the query in a function with input to network message_block
...Additional context
improvement to graphops/graphcast-sdk#88
Describe the bug
Right now, the git workflow that builds docker images are triggered on branch pushes that match v*.*.*
, however docker image tags are pushed without the v
prefix
Expected behavior
The tag should be the same across docker and git
Describe the bug
Currently attestations are triggered after message collection duration, and the messages for the deployment on specific block get cleared. Messages of the old block can still be received after the collection duration and another attestation gets generated.
Expected behavior
Messages of a block should not be stored after the message collection duration.
Add conditional check to message validity to only accept and store messages within collection duration.
Update collection duration to be starting from the local attestation time, instead of the first message of the deployment and block
Implement Discord notifications after they've been added to the SDK (graphops/graphcast-sdk#50)
Problem statement
There's manual work required to add the POI Radio to Launchpad
Expectation proposal
We should submit a pull request to launchpad to include the POI Radio as part of the default indexing stack, alongside any doc updates that are required
Alternative considerations
None
Additional context
None
Use self defined payload after SDK issue allows so
Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Expectation proposal
A clear and concise description of what you want to happen.
Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Generate an ephemeral test topic so that the test instances only send/receive messages on it, in order for the tests to be fully deterministic and not get mixed up with other test instances that might be running at the same time somewhere else. Currently the test topic is just "poi-radio-test".
Problem statement
Diverged subgraphs will send notification every comparison interval. The later notifications serve little purpose to the indexers and can be viewed as spam.
Expectation proposal
Temporarily disable notifications for the diverged subgraphs. The messages should still be sent and compared, but only send notifications if the status of the comparison result has changed from the previous interval (matched <-> diverged).
Alternative considerations
Potentially the agent can be configured such that notifications are sent only sent by an interval, but it is less "smart" than to notify when status has changed significantly.
Problem statement
Currently, all state for the Radio is kept in-memory. While this is all that is essential for the POI Radio to operate (i.e. detect real-time divergences), persisting state across process restarts results in slightly better quality data/performance, and also unlocks other functionality like smarter notifications (e.g. subgraph divergence state changes, rather than individual POI mismatch notifications) and better dashboards and reporting.
Expectation proposal
Move to a persisted state model. Consider a SQL database backend. Consider using https://github.com/launchbadge/sqlx, with SQLite as the (simple) recommended backend for Indexers.
Alternative considerations
Alternatively we can keep the Radio stateless, and somehow export the data to an external sink for persistence, however this added complexity likely doesn't make sense. We could also leave the Radio stateless and accept the constraints that come with that.
Problem statement
Upon start-up, the radio queries the network subgraph for GraphcastID's indexer active allocations and subscribe to corresponding content topics. This means that later changes to the indexer active allocations are not updated to the topics the radio is listening to.
Expectation proposal
Create a separate polling loop on the network subgraph to monitor indexer's active allocations, observe the changes for up-keeping the radio content subscriptions.
This is how we currently get the subgraph ipfs hashes inside compare_attestations
:
let (ipfs_hash, blocks) = match local.iter().next() {
Some(pair) => pair,
None => {
return Ok(ComparisonResult::NotFound(String::from(
"No local attestation found",
)))
}
};
Which is incorrect because that .next()
will just take the first element in local
(problem is caused by my lack of oversight when implementing #48)
The fix is implemented in #75 .
Currently the e2e tests get stuck on Linux, despite working fine on MacOS
Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Expectation proposal
A clear and concise description of what you want to happen.
Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Problem statement
Currently there's notification features for Slack and Discord. We have received feedback from the indexer community that many prefer notifications from Telegram bots.
Expectation proposal
Alternative considerations
Can refactor notification mechanism - curry the 3 types of token accesses upon config initialization
Consider two program states:
Describe the bug
Today, there are a subset of startup error conditions (e.g. indexer address resolving to None) that the Radio will swallow in order to stay alive. This is confusing behaviour because misconfiguration should be very clear to the user.
You could say that the radio is "focused on operational resilience" through both preoperational and operational stages.
Expected behavior
Any checks that fail or conditions required for healthy operation that are not met should cause the process to fail. HOWEVER, once the program has reached an "operational state" (after validating all input configuration and resolving deps) and is successfully operating, resiliency becomes important.
Pre-operational priority: Surface any dependency or configuration issues to the user! If it starts up and continues to run, the user will assume everything is working correctly!
Operational priority: Operational resilience. If someone needs to restart our graph-node dependency and it goes away for 2 mins, that should not cause a failure in the Radio.
This is probably one for the SDK, but it's easy to reproduce and test here so that's why I'm putting it under the POI Radio repo.
Steps to reproduce:
cargo run
topics
, replacing codelet mut topics = topics_query().await;
topics.push("my-other-sg".to_string());
then, force a message to be sent on that dummy topic by adding a block after a normal send message (code):
let radio_message = RadioPayloadMessage::new("my-other-sg".to_string(), content.clone());
match GRAPHCAST_AGENT
.get()
.unwrap()
.send_message(
"my-other-sg".to_string(),
network_name,
message_block,
Some(radio_message),
)
.await
Expected behaviour:
The first instance should not receive this message, therefore should not treat it as a valid message.
Problem statement
We ran into situations in which the radio gets stuck or unable to respond to query requests. For instance,
We might somehow introduced data races (this should be nearly impossible in Rust, very curious to find out the exact cause). To start with, our current practice of using global variables wrapped in OnceCell
and Arc
with std::sync::mutex
or tokio::sync::mutex
might not be effectively preventing deadlock possibilities.
Expectation proposal
In theory, using a Mutex to access a global variable before passing it into async functions can be a valid approach to ensure safe concurrent access.
Alternative Considerations
Add benchmarking to the repo for measuring performances
After the initial work to add basic e2e tests to the repo (#90) we need to convert the tests to actually be cargo tests (under a cfg(tests) annotation) instead of a normal binary crate. This will allow us to sandbox everything and even use conditional compilation in the SDK and main POI Radio code to help us with mocks and test data in general.
Other minor improvement ideas:
scripts
folderProblem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Expectation proposal
A clear and concise description of what you want to happen.
Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Problem statement
There's manual work required to add the POI Radio to StakeSquid's docker-compose stack
Expectation proposal
We should submit a pull request to their repo to include the POI Radio as part of the default indexing stack, alongside any doc updates that are required
Alternative considerations
None
Additional context
None
Problem statement
Provide a clean GraphQL query to show the radio comparison statuses
Expectation proposal
Implement solutions developed for the SDK for graphops/graphcast-sdk#38 and graphops/graphcast-sdk#37 into the POI Radio also.
Problem statement
The radio should host an HTTP server and expose an endpoint to serve either REST or GraphQL APIs. It should provide useful services accessible by other clients or app.
For the minimum implementation, we will only allow queries, so there should be no mutations or subscriptions
Expectation proposal
axum
can be used as it was added as a dependency to serve Prometheus metrics/metrics
for keeping this endpoint secure).async-graphql
seems to have better performance on async functions, subscription support, and is runtime agnostic.tentative examples
basics
GET /health: returns the health status of the radio service.
GET /metrics: returns statistics about the network, including the total number of received messages, the number of active peers, and the number of attestations by comparison results.
GET/POST api/v1/graphql
to convert to GraphQL
messages/:deployment_hash/:block: returns the messages stored for the given deployment and block, include the message data, message count, unique sender count, and the list of attestations.
attestations/:deployment_hash/:block: returns a list of attestations for a given deployment and block.
local-attestations/:deployment_hash/:block: returns the local attestation stored for the given deployment and block, essentially just the nPOI.
example graphql macro
#[derive(Clone)]
pub struct RadioPayloadMessage {
...
}
#[Object]
impl RadioPayloadMessage {
fn {resolver_fn} ...
}
-
Query based: Allow clients to search for a particular cached message or local attestations. These can be graphQL queries, with traits derived from pre-defined structs.
Alternative considerations
Ensure that the server is properly configured to be secure and prevent unauthorized access or attacks.
async-graphql deals more extensively than juniper,
Additional context
perhaps start off with the basic ones like to read the cached objects
We should add better structure to the Slack notification messages that are sent when the comparison function in the main loop doesn't return Ok()
. This should include as much useful information as possible, here's a few points:
Considerations:
I think it would be good to do this issue after we've figured out the one for improving error handling - #10
Problem statement
There should be a process in place to build new Rust binaries for the most popular platforms (MacOS and a few Linux distros), as well as Docker images, so that users can just use those instead of having to build the POI Radio locally.
Expectation proposal
There should be an automatically generated latest release binary and Docker image available on the Github page of the repo.
Alternative considerations
A few possible alternatives:
Additional context
The best way to achieve this is to use Github Actions.
Problem statement
ATM only active allocations are being gossiped and validated, but offchain syncing subgraphs' POIs hold future value for indexers such that they find interest to learn about POI health before allocating.
Expectation proposal
Update generate_topics
function to include offchain syncing subgraphs. We can do so by querying Graph node indexingStatuses
with filter for the ones with node != removed
. Provide a summary for the count on synced=true
, health="healthy
, nonFatalErrors!=null
and fatalError!=null
. Include synced subgraphs as part of the content topics.
Update process_message
function to check that message identifier is among sender's allocations. The allocation query has been already using status=Active
filter, so use indexer stake if active, 0 if not.
Attestation should potentially use two different senders arrays: active_indexers
, offchain_syncing_indexers
. Thus during comparison, offchain_syncing_indexers' POI should be effective when searching for number of unique attestations, notification and logs should separately show the two arrays, but will not officially result in POI divergence due to the 0 stake.
Problem statement
Currently the POI generation is sequential, using for loop
on the identifiers. One POI query could block the rest such that the blocked queries will be gossiped late.
Expectation proposal
Perhaps batched 20-50 topics operation be made in parallelism. (Best to include both send and compare messages for the topic).
Include better summary on the reason a deployment is not sending or comparing messages
Problem statement
Graph nodes configured with different block providers may have different chainhead blocks. The current delay is 5 blocks after the chainhead to send message and 2 blocks to gather remote messages before attesting for POI divergence. We should provide a more reasonable and configurable period of waiting for messages.
Expectation proposal
Replace static constant let wait_block_duration = 2;
with an environmental variable based on time, and set the fallback to 30minutes.
In order to keep test messages isolated between parallel test runs, we need a unique Radio name for each test run. This can be in the format of a appended integer or UUID or really whatever to keep it unique.
Problem statement
We want the Radio to export metrics via Prometheus in order to make visualisation of POI Radio data inside Grafana easy
Expectation proposal
As a user, I want to see:
Specific metric ideas (pinch of salt pls)
The POI Radio should export a variety of metrics indicating the state of the system:
Problem statement
It is useful to be able to know which dependencies are attached to a given POI attestation. This would help us root cause divergence issues.
Expectation proposal
Imagine an Indexer could pass any number of supported dependencies as configuration. If this were a flag, it could take the form:
--dep <type>:<id>=<uri>
poi-radio --dep postgresql:primary=postgresql://host:5432 --dep chain:mainnet=http://geth:8545
For each provided flag, handler logic could be defined for the type
that allows POI Radio to extract the version. For example, for the chain
type, the POI radio could call web3_clientVersion
at the provided uri
to get the client version. A SQL statement could similarly be executed to get PostgreSQL version.
The resulting dependency information could be attached to POI messages.
Alternative considerations
None, but please comment with ideas.
Problem statement
From nPOI to nPOI, we can see the amount of aggregate attesting stake, but we can't see, at a glance, if the same group of indexers have attested from one nPOI to the next. This would be super helpful when charting nPOIs as it allows us to visualise groups of indexers across nPOI.
Expectation proposal
For each nPOI entry, it would be good to have a single identifier for the group of attesting indexers. We could generate this by deterministically ordering the indexer addresses attesting to a nPOI, and then hashing that list. We can then track "groups" of attesting stake from nPOI to nPOI.
Expectation proposal
Release docker image with semvar tagging
Alternative considerations
Also check out Docker's official GitHub Actions for managing tags
We should add more integration tests to the POI Radio, this can include checks for:
Please feel free to suggest more integration test scenarios.
Problem statement
We would like to expand the POI radio to gossip messages on a Subgraph level, instead of only nPOI for each subgraphs. We assume that Indexers can simply run a single instance of Subgraph Radio and be able to track all gossips such as nPOIs and deployment health, and potentially versioning of setups.
Expectation proposal
Rough outline
RadioPayloadMessage
to be specifically ProofOfIndexingMessage
ProofOfIndexingMessage
and DeploymentHealthMessage
DeploymentHealthMessage
radio_msg_handler
to match generalized message types
message_send
for generalized message typesmessage_compare
for generalized message typesAlternative considerations
open to suggestions and ideas
Describe the bug
Currently both SERVER_HOST:SERVER_PORT/metrics
and METRICS_PORT:METRICS_HOST/metrics
provide metrics for prometheus scrapping with different data. There should not be two different endpoints for metrics and would require users to configure Prometheus to scrape two endpoints for a single service.
Expected behavior
Rather than the HTTP server exposing a /metrics endpoint, move HTTP metrics via the existing metrics interface if the HTTP server is enabled, and remove the current path to SERVER_HOST:SERVER_PORT/metrics
.
Problem statement
Today, the POI Radio only gossips and subscribes to gossip about deployments that the current indexer has allocated to. Sometimes the Indexer is indexing many more subgraphs, and POI determinism continues to be relevant even when they are not actively allocated.
Expectation proposal
Allow the indexer to opt into participating in gossip for all deployments that they are currently indexing, instead of just those that they are allocated to.
Practically this should be a change of querying the core network subgraph for allocated deployments, to calling the indexingStatuses endpoint to get back all indexing deploys.
We may want to implement a lookup on the receiving of a gossiped nPOI end to check whether the sending indexer has actively allocated to it or not, IF we want to treat that case differently.
Update after #62 merges
Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Expectation proposal
A clear and concise description of what you want to happen.
Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Describe the bug
@stake-machine has a Fantom
subgraph on testnet and the Radio is showing this warning:
2023-04-21T09:22:24.166410Z WARN graphcast_sdk: err_msg: "Subgraph is indexing an unsupported network fantom, please report an issue on https://github.com/graphops/graphcast-rs"
at /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/graphcast-sdk-0.1.0/src/lib.rs:159
Expected behavior
The Radio should support fantom subgraphs, since Fantom
is in out supported networks list here.
Problem statement
Deployment health is a different message than nPOI messages. We need separate logic around the new message type
Expectation proposal
DeploymentHealthMessage
struct should contain fields
deployment: String
health: Health // Enum
errors: Vec<SubgraphError>
error_type: ErrorType // Enum
message: String
block: Optional<BlockPointer>
handler: Optional<String>
deterministic: Bool
DeploymentHealthMessage
from indexing statuses queryDeploymentHealthMessage
Alternative considerations
Generalize message handlers
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.