Giter Site home page Giter Site logo

informalsystems / tendermint Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tendermint/tendermint

16.0 9.0 13.0 146.78 MB

A temporary fork of the original Tendermint Core repository (please use CometBFT instead)

Home Page: https://github.com/cometbft/cometbft

License: Apache License 2.0

Shell 0.62% Python 0.61% Go 84.10% TeX 7.93% Makefile 0.44% HTML 0.02% HCL 0.05% TLA 5.75% Dockerfile 0.16% Jinja 0.32%
bft consensus tendermint

tendermint's Issues

Gossip data to a peer without valid channel increases cpu usage

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):
0.34.23

ABCI app (name for built-in, URL for self-written if it's publicly available):
https://github.com/public-awesome/stargaze

Environment:

  • OS ubuntu 20.04+

What happened:
Currently stargaze mainnet network have multiple reports of increased cpu usage without any meaningful change in our current stack.

After digging a bit we were able to find that gossipDataRoutine and specifically the gossipDataForCatchup method was causing this increase in.

In the following snippet if SendEnvelopeShim fails, it just immediately retries to gossip the same block part until the peer state changes (different round etc), but it generates more work because is loading block meta and block part from disk.

if p2p.SendEnvelopeShim(peer, p2p.Envelope{ //nolint: staticcheck
ChannelID: DataChannel,
Message: &tmcons.BlockPart{
Height: prs.Height, // Not our height, so it doesn't matter.
Round: prs.Round, // Not our height, so it doesn't matter.
Part: *pp,
},
}, logger) {
ps.SetHasProposalBlockPart(prs.Height, prs.Round, index)
} else {
logger.Debug("Sending block part for catchup failed")
}
return

adding a small sleep like in other error checks fixes the problem, like in our fork public-awesome@da5a32f which seemed to reduce the cpu usage.
time.Sleep(conR.conS.config.PeerGossipSleepDuration)

Currently there is no way to know from this method if the peer is valid for sending the packet, hasChannel is a private method, but ideally we could save loading from disk if we could check first peer.IsValid() then execute the remaining logic.

What you expected to happen:
To add a delay or a check that prevents sending to info to a peer with an invalid state

Have you tried the latest version: yes/no
Yes

How to reproduce it (as minimally and precisely as possible):
Hard to replicate current network conditions as it seems there is some invalid peers in the network causing this issue, but joining the network with a new node will replicate it.

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

Config (you can paste only the changes you've made):

node command runtime flags:

Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs

Anything else we need to know:

http_json_handler: Don't have err failed to write response log entire error

Summary

There are many causes of failed to write response errors in this file: https://github.com/tendermint/tendermint/blob/v0.34.24/rpc/jsonrpc/server/http_json_handler.go#L19-L129

One of these frequently leads to massive "log bombs", that spans thousands of characters and crowds out all content in the CLI logs (and looks quite concerning)

 Nov 17 01:49:48 <node_name> cosmovisor[1541013]: 1:49AM ERR failed to write responses err="write tcp 127.0.0.1:26657->127.0.0.1:34266: i/o timeout" module=rpc-server res=[{"id":998254319713,"jsonrpc":"2.0","result":{"total_count":"329048","txs":$MASSIVE_AMOUNT_OF_TEXT

There are some \n characters in this output that end up breaking into multiple lines as well, and breaking parsing formats.

Can we make a smaller result serialization here, that just has a truncated / trimmed amount of data, that gets output to the CLI?

This would significantly reduce the size of logs, and the amount of information that can be easily gleaned by scrolling through them.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

Consensus error reported from validator using Gaia v9.0.x

Validator reported the following consensus error. Currently waiting for additional infomation/logs.

Tendermint version:
0.34.26

Gaia version:
9.0.x waiting for exact version, therefore this is v9.0.0 or v9.0.1.

Environment:

  • OS (e.g. from /etc/os-release):
  • Install tools:
  • Others:

What happened:
Application stopped validating blocks. Validator jailed.

What you expected to happen:
No error.

Have you tried the latest version: yes/no
no

How to reproduce it (as minimally and precisely as possible):
Unknown

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
https://pastebin.com/AjJZUrVP

Config (you can paste only the changes you've made):
Waiting for more information.

node command runtime flags:
Waiting for more information.

Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs
Asked, waiting for more information.

Anything else we need to know:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.