Giter Site home page Giter Site logo

graphops / poi-radio Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 2.0 583 KB

POI Radio monitors subgraph data integrity in real time using Graphcast SDK

Home Page: https://docs.graphops.xyz/graphcast/radios/poi-radio

License: Apache License 2.0

Rust 99.14% Shell 0.31% Dockerfile 0.56%
graph-protocol graphcast indexers radio the-graph

poi-radio's People

Contributors

chriswessels avatar hopeyen avatar petkodes avatar stake-machine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

poi-radio's Issues

"invalid type: null, expected a string" validation error

Describe the bug
Both Suntzu Indexer and Data Nexus have encountered this issue:

thread 'main' panicked at 'Could not validate the supplied configurations: Validate the input: Graph node endpoint must be able to serve indexing statuses query: error decoding response body: invalid type: null, expected a string at line 1 column 2868', src/main.rs:45:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This happens right after the Graphcast ID resolves to a valid Indexer address.
It's happening both when running the Radio as a binary and when running it inside a Docker image.
These were the config variables used:

/usr/local/bin/poi-radio \
  --graphcast-network "mainnet" \
  --registry-subgraph "https://api.thegraph.com/subgraphs/name/hopeyen/graphcast-registry-mainnet" \
  --network-subgraph "https://gateway.thegraph.com/network" \
  --private-key "priv_key" \
  --graph-node-endpoint "http://graph-node-0:8030/graphql"

Additional context: the graph node instance was not on the same machine as the POI Radio instance.

POI Radio does not resolve Indexer address from valid Graphcast ID

Describe the bug
Registry contract does not resolve Indexer address from Graphcast ID

To Reproduce
When running POI Radio with a valid private key for a Graphcast ID address, the Radio properly derives the address

DEBUG graphcast_sdk: Wallet address: 0xd8b0a336a27e57dd163d19e49bb153c631c49697

But then does not resolve that to an Indexer address

INFO poi_radio: Acting on behave of indexer None with stake 0

When the address is registered in the registry contract.

Environment variables used:

PRIVATE_KEY="GRAPHCAST_ID_PRIVATE_KEY"
GRAPH_NODE_STATUS_ENDPOINT="http://host.docker.internal:8030/graphql"
REGISTRY_SUBGRAPH="https://api.thegraph.com/subgraphs/name/hopeyen/graphcast-registry-goerli"
NETWORK_SUBGRAPH="https://gateway.testnet.thegraph.com/network"
GRAPHCAST_NETWORK=testnet
RUST_LOG="off,hyper=off,graphcast_sdk=debug,poi_radio=debug,integration_tests=debug"

Additional context
This is the same issue that was encountered during the IndexerDAO workshop on Mar 13

Main event silent failure

Describe the bug
From indexer Suntzu's report, we found that the radio can fail or get stuck silently.

The most recent logs were 2 days before the time of report which the radio has not exited the program. The most recent 100 logs are all gowaku node, and it stopped when the last peer was disconnected.

output:
2023-04-25T16:59:09.106Z        INFO    gowaku.node2.filter     filter/waku_filter.go:137       received a message push {"fullNode": false, "peer": "16Uiu2HAm5uqfdh7z2YTEps2MhvsXTk3uvSHZ9AtVkzipZZGbKJEL", "messages": 1}
2023-04-25T16:59:16.307Z        INFO    gowaku.node2    node/connectedness.go:68        peer disconnected       {"peer": "16Uiu2HAm5uqfdh7z2YTEps2MhvsXTk3uvSHZ9AtVkzipZZGbKJEL"}

In theory, main loop will periodically update the states of network and subgraphs with logs printed, and will reconnect to the peers that went on/offline. We don't see that here so the event might be stuck/silently failed.

Expected behavior
Radio should

  • Fail or retry later if a operational process panics
  • Timeout and retry later if a process is stuck

Screenshots
Screenshot_2023-04-27_at_13 48 25

go-waku node compilation issues

Describe the bug
When building the newest go-waku as part of the bindings compilation on M1, we see linker issues:

/Users/petko/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/aarch64-apple-darwin/lib" "-o" "/Users/petko/work/poi-radio-e2e-tests/target/debug/deps/poi_radio_e2e_tests-8b198dc3a9aab04f" "-Wl,-dead_strip" "-nodefaultlibs"
  = note: Undefined symbols for architecture arm64:
            "_FSEventStreamCreate", referenced from:
                __cgo_9d3be7b4e652_Cfunc_EventStreamCreate in libwaku_sys-9f01d778247c290b.rlib(000039.o)
            "_FSEventStreamInvalidate", referenced from:
                __cgo_9d3be7b4e652_Cfunc_FSEventStreamInvalidate in libwaku_sys-9f01d778247c290b.rlib(000039.o)
                ...

This shouldn't affect dev or our Docker images since the Cargo.lock is on an older version of go-waku there.
We've reported this to the waku team and are waiting for fixes/suggestions.

502 Topic generation error

Describe the bug
@stake-machine has reported seeing this topic generation error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: []', /poi-radio/src/lib.rs:120:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  2023-04-22T09:00:04.005612Z ERROR poi_radio: Topic generation error: HTTP status server error (502 Bad Gateway) for url (https://gateway.thegraph.com/network)
    at src/lib.rs:117

Expected behavior
Perhaps a retry mechanism or soft failure would be more suitable

[Feat.Req] Utilize indexingStatuses for block information and message trigger

Problem statement
Currently the block provider is required to trigger a query to the graph node and construct the POI message. For multichain support, different block providers are needed for each indexing networks. This creates overhead on matching providers with correct subgraph indexing networks.

Expectation proposal

  • Upon start-up, query all IndexingStatuses and populate the NETWORKS list with statically configured block_interval
  • Query indexingStatuses for all active allocations (later can update to all indexing deployments), and loop for each allocation's results
    - For subgraphs on the same indexing network, we can assume their chainhead_block information is the same, thus message_block calculation would be consistent across allocations within a NETWORK
    - message_block = chainhead_block - chainhead_block % block_interval
    - if latest_block >= message_block, query poi using message_block number and hash and send message

Alternative considerations
If chainhead_block has not progressed for enough block_interval since the last polling period, the same messages would be sent again. I'm ignoring this case for now because the logic to track previous message block sent across allocations on the same network need to be thought through a bit more
Ideas for later on,

  • In the Network struct, add fields for prev_message_block to track the last determined block to send message on.
  • Upon start-up, query all IndexingStatuses and populate the NETWORKS list with prev_message_block initialized to 0.
  • Use a separate loop after sending messages to update the prev_message_block. This cannot happen within the message send loop since one update will affect the subsequent allocations on the same network. (We could get away by threading or group the allocations by networks and nest the query in a function with input to network message_block...

Additional context
improvement to graphops/graphcast-sdk#88

Match git tag with docker tag

Describe the bug
Right now, the git workflow that builds docker images are triggered on branch pushes that match v*.*.*, however docker image tags are pushed without the v prefix

Expected behavior
The tag should be the same across docker and git

Messages still get stored after collection duration

Describe the bug

Currently attestations are triggered after message collection duration, and the messages for the deployment on specific block get cleared. Messages of the old block can still be received after the collection duration and another attestation gets generated.

Expected behavior

Messages of a block should not be stored after the message collection duration.
Add conditional check to message validity to only accept and store messages within collection duration.
Update collection duration to be starting from the local attestation time, instead of the first message of the deployment and block

Add POI Radio support to Launchpad

Problem statement
There's manual work required to add the POI Radio to Launchpad

Expectation proposal
We should submit a pull request to launchpad to include the POI Radio as part of the default indexing stack, alongside any doc updates that are required

Alternative considerations
None

Additional context
None

[Feat.Req] Generate an ephemeral test topic

Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Expectation proposal
A clear and concise description of what you want to happen.

Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Generate an ephemeral test topic so that the test instances only send/receive messages on it, in order for the tests to be fully deterministic and not get mixed up with other test instances that might be running at the same time somewhere else. Currently the test topic is just "poi-radio-test".

[FRQ] Disable notifications for already diverged subgraphs

Problem statement

Diverged subgraphs will send notification every comparison interval. The later notifications serve little purpose to the indexers and can be viewed as spam.

Expectation proposal

Temporarily disable notifications for the diverged subgraphs. The messages should still be sent and compared, but only send notifications if the status of the comparison result has changed from the previous interval (matched <-> diverged).

Alternative considerations
Potentially the agent can be configured such that notifications are sent only sent by an interval, but it is less "smart" than to notify when status has changed significantly.

[FRQ] Add Persistence

Problem statement
Currently, all state for the Radio is kept in-memory. While this is all that is essential for the POI Radio to operate (i.e. detect real-time divergences), persisting state across process restarts results in slightly better quality data/performance, and also unlocks other functionality like smarter notifications (e.g. subgraph divergence state changes, rather than individual POI mismatch notifications) and better dashboards and reporting.

Expectation proposal
Move to a persisted state model. Consider a SQL database backend. Consider using https://github.com/launchbadge/sqlx, with SQLite as the (simple) recommended backend for Indexers.

Alternative considerations
Alternatively we can keep the Radio stateless, and somehow export the data to an external sink for persistence, however this added complexity likely doesn't make sense. We could also leave the Radio stateless and accept the constraints that come with that.

[Feat.Req] Periodic polling of indexer active allocations for content topics

Problem statement
Upon start-up, the radio queries the network subgraph for GraphcastID's indexer active allocations and subscribe to corresponding content topics. This means that later changes to the indexer active allocations are not updated to the topics the radio is listening to.

Expectation proposal

Create a separate polling loop on the network subgraph to monitor indexer's active allocations, observe the changes for up-keeping the radio content subscriptions.

Incorrect logic in compare_attestations

This is how we currently get the subgraph ipfs hashes inside compare_attestations:

    let (ipfs_hash, blocks) = match local.iter().next() {
        Some(pair) => pair,
        None => {
            return Ok(ComparisonResult::NotFound(String::from(
                "No local attestation found",
            )))
        }
    };

Which is incorrect because that .next() will just take the first element in local (problem is caused by my lack of oversight when implementing #48)

The fix is implemented in #75 .

[DRAFT] [Feat.Req] Add integration test for topic updates

Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Expectation proposal
A clear and concise description of what you want to happen.

Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[FRQ] Telegram notifications

Problem statement
Currently there's notification features for Slack and Discord. We have received feedback from the indexer community that many prefer notifications from Telegram bots.

Expectation proposal

  • Add configurations supplying Telegram bot access.
  • Send notifications to Telegram similar to the other bots, if access provided.

Alternative considerations
Can refactor notification mechanism - curry the 3 types of token accesses upon config initialization

Failure conditions before operational success should stop the process

Consider two program states:

  • Preoperational: All the stages through startup, up until regular operation
  • Operational: regular operation, polling dependencies, tracking blocks, comparisons, etc

Describe the bug
Today, there are a subset of startup error conditions (e.g. indexer address resolving to None) that the Radio will swallow in order to stay alive. This is confusing behaviour because misconfiguration should be very clear to the user.

You could say that the radio is "focused on operational resilience" through both preoperational and operational stages.

Expected behavior
Any checks that fail or conditions required for healthy operation that are not met should cause the process to fail. HOWEVER, once the program has reached an "operational state" (after validating all input configuration and resolving deps) and is successfully operating, resiliency becomes important.

Pre-operational priority: Surface any dependency or configuration issues to the user! If it starts up and continues to run, the user will assume everything is working correctly!
Operational priority: Operational resilience. If someone needs to restart our graph-node dependency and it goes away for 2 mins, that should not cause a failure in the Radio.

Topic filtering not working properly

This is probably one for the SDK, but it's easy to reproduce and test here so that's why I'm putting it under the POI Radio repo.

Steps to reproduce:

  • Start a normal POI Radio instance cargo run
  • Push a dummy topic to topics, replacing code
let mut topics = topics_query().await;
topics.push("my-other-sg".to_string());

then, force a message to be sent on that dummy topic by adding a block after a normal send message (code):

  let radio_message = RadioPayloadMessage::new("my-other-sg".to_string(), content.clone());
  match GRAPHCAST_AGENT
      .get()
      .unwrap()
      .send_message(
          "my-other-sg".to_string(),
          network_name,
          message_block,
          Some(radio_message),
      )
      .await

Expected behaviour:
The first instance should not receive this message, therefore should not treat it as a valid message.

[Bug] Unsafe threads

Problem statement

We ran into situations in which the radio gets stuck or unable to respond to query requests. For instance,

  • content topic update (described in #115)
  • comparison result GraphQL query
  • logging of summary

We might somehow introduced data races (this should be nearly impossible in Rust, very curious to find out the exact cause). To start with, our current practice of using global variables wrapped in OnceCell and Arc with std::sync::mutex or tokio::sync::mutex might not be effectively preventing deadlock possibilities.

Expectation proposal

In theory, using a Mutex to access a global variable before passing it into async functions can be a valid approach to ensure safe concurrent access.

  • Review the approach to access global variables.
  • Redesign the concurrent aspects; Potentially create a global struct to contain all the variables
  • Make sure that async tasks acquire and release locks asynchronously and do not access global var directly.
  • Add timeouts to async functions to avoid threads sitting around forever

Alternative Considerations

  • Use RwLock: for global variable that is read-heavy and has infrequent writes, as RwLock multiple readers or a single writer.
  • Atomic Types: for simple primitive or atomic type.
  • Message Passing: to communicate and share data between async tasks.(tokio::sync::mpsc) to send data between tasks, as MPSC ensures that only one task has ownership of the data at a time and preventing concurrent access issues.
  • Setup tokio_console

Convert integration tests to cfg[test] blocks and not binary crate

After the initial work to add basic e2e tests to the repo (#90) we need to convert the tests to actually be cargo tests (under a cfg(tests) annotation) instead of a normal binary crate. This will allow us to sandbox everything and even use conditional compilation in the SDK and main POI Radio code to help us with mocks and test data in general.

Other minor improvement ideas:

  • Move the test script into the scripts folder
  • Switch to using ephemeral Radio name
  • Remove Docker and just use shell script

[Feat.Req] Add integration test for topic updates

Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Expectation proposal
A clear and concise description of what you want to happen.

Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add POI Radio support to StakeSquid docker-compose stack

Problem statement
There's manual work required to add the POI Radio to StakeSquid's docker-compose stack

Expectation proposal
We should submit a pull request to their repo to include the POI Radio as part of the default indexing stack, alongside any doc updates that are required

Alternative considerations
None

Additional context
None

[Feat.Req] GraphQL Object and Resolver for comparison results

Problem statement
Provide a clean GraphQL query to show the radio comparison statuses

Expectation proposal

  • ComparisonResult to be GraphQL objects
  • Resolver to provide comparison status of deployments: Deployment Hash, Display Name, NetworkEpoch (?), Number of indexer attesting, stake-weight consensus nPOI, Number of Indexers for each NPOI
  • Example query documented and imported in Grafana for a tabular view

[Feat.Req] Host http server and expose endpoint

Problem statement

The radio should host an HTTP server and expose an endpoint to serve either REST or GraphQL APIs. It should provide useful services accessible by other clients or app.

For the minimum implementation, we will only allow queries, so there should be no mutations or subscriptions

Expectation proposal

  • Exposes well-defined API with both GraphQL and RESTful formats.
    • crate axum can be used as it was added as a dependency to serve Prometheus metrics
    • optional when running the Radio app (keep separate with /metrics for keeping this endpoint secure).
  • Builds GraphiQLSrouce and defines GraphQL schema and resolver for objects with pre-defined structs
    • crate async-graphql seems to have better performance on async functions, subscription support, and is runtime agnostic.

tentative examples

  • basics

      GET /health: returns the health status of the radio service.
      GET /metrics: returns statistics about the network, including the total number of received messages, the number of active peers, and the number of attestations by comparison results.
      GET/POST api/v1/graphql
    

    to convert to GraphQL

      messages/:deployment_hash/:block: returns the messages stored for the given deployment and block, include the message data, message count, unique sender count, and the list of attestations.
      attestations/:deployment_hash/:block: returns a list of attestations for a given deployment and block.
      local-attestations/:deployment_hash/:block: returns the local attestation stored for the given deployment and block, essentially just the nPOI.
    
  • example graphql macro

    #[derive(Clone)]
    pub struct RadioPayloadMessage {
       ...
    }
    
    #[Object]
    impl RadioPayloadMessage {
       fn {resolver_fn} ...
    }
    - 
    
    
  • Query based: Allow clients to search for a particular cached message or local attestations. These can be graphQL queries, with traits derived from pre-defined structs.

Alternative considerations
Ensure that the server is properly configured to be secure and prevent unauthorized access or attacks.
async-graphql deals more extensively than juniper,

Additional context
perhaps start off with the basic ones like to read the cached objects

Improve Slack notifications

We should add better structure to the Slack notification messages that are sent when the comparison function in the main loop doesn't return Ok(). This should include as much useful information as possible, here's a few points:

  • If the POI is diverging, send the list of senders who back that POI along with their stake. Also send info about other peers that have the same POI as the current Radio instance.
  • Don't overdo it with notifications, maybe some errors/warnings are not important enough to trigger notifications

Considerations:
I think it would be good to do this issue after we've figured out the one for improving error handling - #10

Automate building release binaries and Docker images

Problem statement
There should be a process in place to build new Rust binaries for the most popular platforms (MacOS and a few Linux distros), as well as Docker images, so that users can just use those instead of having to build the POI Radio locally.

Expectation proposal
There should be an automatically generated latest release binary and Docker image available on the Github page of the repo.

Alternative considerations
A few possible alternatives:

  • Using something like Neon and distributing the POI Radio via npm
  • Distributing the POI Radio as a binary in crates.io (this would still require the user to have Rust installed)
  • Requiring users to clone the repo, then build & run it locally

Additional context
The best way to achieve this is to use Github Actions.

[Feat.Req] Option to include syncing deployments in content topics

Problem statement

ATM only active allocations are being gossiped and validated, but offchain syncing subgraphs' POIs hold future value for indexers such that they find interest to learn about POI health before allocating.

Expectation proposal

Update generate_topics function to include offchain syncing subgraphs. We can do so by querying Graph node indexingStatuses with filter for the ones with node != removed. Provide a summary for the count on synced=true, health="healthy, nonFatalErrors!=null and fatalError!=null. Include synced subgraphs as part of the content topics.

Update process_message function to check that message identifier is among sender's allocations. The allocation query has been already using status=Active filter, so use indexer stake if active, 0 if not.

Attestation should potentially use two different senders arrays: active_indexers, offchain_syncing_indexers. Thus during comparison, offchain_syncing_indexers' POI should be effective when searching for number of unique attestations, notification and logs should separately show the two arrays, but will not officially result in POI divergence due to the 0 stake.

[Feat.Req] Content topic Parallelization

Problem statement
Currently the POI generation is sequential, using for loop on the identifiers. One POI query could block the rest such that the blocked queries will be gossiped late.

Expectation proposal
Perhaps batched 20-50 topics operation be made in parallelism. (Best to include both send and compare messages for the topic).

Include better summary on the reason a deployment is not sending or comparing messages

[Feat.Req] Configurable delayed message period

Problem statement

Graph nodes configured with different block providers may have different chainhead blocks. The current delay is 5 blocks after the chainhead to send message and 2 blocks to gather remote messages before attesting for POI divergence. We should provide a more reasonable and configurable period of waiting for messages.

Expectation proposal

Replace static constant let wait_block_duration = 2; with an environmental variable based on time, and set the fallback to 30minutes.

Unique test Radio name

In order to keep test messages isolated between parallel test runs, we need a unique Radio name for each test run. This can be in the format of a appended integer or UUID or really whatever to keep it unique.

[EPIC] Improve e2e tests

Improvement ideas:

  • One-off test topic to keep test messages isolated between parallel test runs - #154
  • Make sure tests work on Ubuntu - #155
  • Create a bash script to help with running tests while introducing changes to POI Radio - #156
  • Add e2e tests in POI Radio repo - #166

Add Prometheus metrics to POI Radio

Problem statement
We want the Radio to export metrics via Prometheus in order to make visualisation of POI Radio data inside Grafana easy

Expectation proposal
As a user, I want to see:

  • About my own deployments/indexers
    • How many of my deployments are being actively crosschecked with other indexers (remotePOIs != 0)
    • How many and which of my local deployments are non-consensus POIs
    • How many indexers are participating across the deployments that the indexer is currently allocated to
  • About data integrity in the wider network
    • For subgraphs that I am allocated to(?), how many nPOI groups exist and what is the total attesting stake per nPOI for the deploy, total attesting stake per "indexer group hash" (#60)

Specific metric ideas (pinch of salt pls)
The POI Radio should export a variety of metrics indicating the state of the system:

  • aggregate_attesting_stake_npoi
    • labels: deployment hash, npoi, network, blockNumber
    • value: aggregate attesting stake in GRT
    • use cases: visualising total attesting stake in the network, visualising number of distinct nPOIs per subgraph deployment, visualising attesting stake per deployment or per network
  • aggregate_attesting_stake_indexer_group
    • labels: deployment hash, indexerGroupHash, network, blockNumber
    • value: aggregate attesting stake in GRT
    • use cases: visualising total attesting stake in the network, visualising number of attesting groups per subgraph deployment, visualising attesting stake per deployment or per network

[FRQ] Optionally pass in dependencies

Problem statement
It is useful to be able to know which dependencies are attached to a given POI attestation. This would help us root cause divergence issues.

Expectation proposal
Imagine an Indexer could pass any number of supported dependencies as configuration. If this were a flag, it could take the form:

--dep <type>:<id>=<uri>

poi-radio --dep postgresql:primary=postgresql://host:5432 --dep chain:mainnet=http://geth:8545

For each provided flag, handler logic could be defined for the type that allows POI Radio to extract the version. For example, for the chain type, the POI radio could call web3_clientVersion at the provided uri to get the client version. A SQL statement could similarly be executed to get PostgreSQL version.

The resulting dependency information could be attached to POI messages.

Alternative considerations
None, but please comment with ideas.

Track attestation groups across nPOIs

Problem statement
From nPOI to nPOI, we can see the amount of aggregate attesting stake, but we can't see, at a glance, if the same group of indexers have attested from one nPOI to the next. This would be super helpful when charting nPOIs as it allows us to visualise groups of indexers across nPOI.

Expectation proposal
For each nPOI entry, it would be good to have a single identifier for the group of attesting indexers. We could generate this by deterministically ordering the indexer addresses attesting to a nPOI, and then hashing that list. We can then track "groups" of attesting stake from nPOI to nPOI.

Integration tests

We should add more integration tests to the POI Radio, this can include checks for:

  • first time sender
  • attestation conflicts
  • ...

Please feel free to suggest more integration test scenarios.

[FRQ] Rescope POI radio to Subgraph radio

Problem statement
We would like to expand the POI radio to gossip messages on a Subgraph level, instead of only nPOI for each subgraphs. We assume that Indexers can simply run a single instance of Subgraph Radio and be able to track all gossips such as nPOIs and deployment health, and potentially versioning of setups.

Expectation proposal
Rough outline

  • Rename relevant namespaces
  • Update RadioPayloadMessage to be specifically ProofOfIndexingMessage
  • Configurable message types to turn on with possible values. Start off with ProofOfIndexingMessage and DeploymentHealthMessage
  • Add struct, construction, parsing, and comparison mechanism for DeploymentHealthMessage
  • New radio_msg_handler to match generalized message types
    • compare benefits: 1. generalized storage into 1 array, same validation processes, sorted at attestation; 2. separate storage for each message type, modular validation process (not necessary as indexer identity requirement is the same), no sorting at attestation
  • New local_attestations struct to consist fields for all message types
  • New message_send for generalized message types
  • New message_compare for generalized message types

Alternative considerations
open to suggestions and ideas

Remove http server `/metrics` path

Describe the bug
Currently both SERVER_HOST:SERVER_PORT/metrics and METRICS_PORT:METRICS_HOST/metrics provide metrics for prometheus scrapping with different data. There should not be two different endpoints for metrics and would require users to configure Prometheus to scrape two endpoints for a single service.

Expected behavior
Rather than the HTTP server exposing a /metrics endpoint, move HTTP metrics via the existing metrics interface if the HTTP server is enabled, and remove the current path to SERVER_HOST:SERVER_PORT/metrics.

Add toggle for gossiping about non-allocated deployments

Problem statement
Today, the POI Radio only gossips and subscribes to gossip about deployments that the current indexer has allocated to. Sometimes the Indexer is indexing many more subgraphs, and POI determinism continues to be relevant even when they are not actively allocated.

Expectation proposal
Allow the indexer to opt into participating in gossip for all deployments that they are currently indexing, instead of just those that they are allocated to.

Practically this should be a change of querying the core network subgraph for allocated deployments, to calling the indexingStatuses endpoint to get back all indexing deploys.

We may want to implement a lookup on the receiving of a gossiped nPOI end to check whether the sending indexer has actively allocated to it or not, IF we want to treat that case differently.

[DRAFT] [Refactor] Switch to AsyncMutex where possible + add timeouts to .await calls

Problem statement
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Expectation proposal
A clear and concise description of what you want to happen.

Alternative considerations
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Fantom subgraphs throw an "unsupported network" warning

Describe the bug
@stake-machine has a Fantom subgraph on testnet and the Radio is showing this warning:

  2023-04-21T09:22:24.166410Z  WARN graphcast_sdk: err_msg: "Subgraph is indexing an unsupported network fantom, please report an issue on https://github.com/graphops/graphcast-rs"
    at /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/graphcast-sdk-0.1.0/src/lib.rs:159

Expected behavior
The Radio should support fantom subgraphs, since Fantom is in out supported networks list here.

[FRQ] add DeploymentHealthMessage

Problem statement
Deployment health is a different message than nPOI messages. We need separate logic around the new message type

Expectation proposal

  • New DeploymentHealthMessage struct should contain fields
    deployment: String
    health: Health // Enum
    errors: Vec<SubgraphError>
    
    and each SubgraphError is a either nonFatal or fatal error returned by the graph node status endpoint
    error_type: ErrorType // Enum
    message: String
    block: Optional<BlockPointer>
    handler: Optional<String>
    deterministic: Bool
    
  • Construct DeploymentHealthMessage from indexing statuses query
  • Comparison mechanism for DeploymentHealthMessage
    • Event driven instead of periodic messages like nPOI
    • Send notifications for health discrepancy
    • Explore other use cases like unicast channels for automatic debug info sharing
  • Update local_attestations struct to consist fields for deployment health

Alternative considerations
Generalize message handlers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.