taraxa-project / taraxa-node Goto Github PK

View Code? Open in Web Editor NEW

37.0 11.0 18.0 922.66 MB

Taraxa blockchain full node implementation

License: MIT License

C++ 93.99% Dockerfile 0.19% Shell 0.26% CMake 2.56% Python 2.66% C 0.04% Mustache 0.29%

taraxa blockchain pbft dag

taraxa-node's Introduction

Introducing Taraxa

Taraxa is a Practical Byzantine Fault Tolerance blockchain.

Whitepaper

You can read the Taraxa Whitepaper at https://www.taraxa.io/whitepaper.

Quickstart

Just want to get up and running quickly? We have pre-built docker images for your convenience. More details are in our quickstart guide.

Downloading

There are 2 options how to run the latest version of taraxa-node:

Docker image

Download and run taraxa docker image with pre-installed taraxad binary here.

Ubuntu binary

Download and run statically linked taraxad binary here.

Building

If you would like to build from source, we do have build instructions for Linux (Ubuntu LTS) and macOS.

Running

Inside docker image

taraxad --conf_taraxa /etc/taraxad/taraxad.conf

Pre-built binary or manual build:

./taraxad --conf_taraxa /path/to/config/file

Contributing

Want to contribute to Taraxa repository ? We in Taraxa highly appreciate community work so if you feel like you want to participate you are more than welcome. You can start by reading contributing tutorial.

Useful doc

System Requirements

For a full web node, you need at least ...GB of disk space available. The block log of the blockchain itself is a little over ...GB. It's highly recommended to run taraxad on a fast disk such as an SSD. At least ...GB of memory is required for a full web node. Any CPU with decent single core performance should be sufficient. Taraxa is constantly growing, so you may find you need more disk space to run a full node. We are also constantly working on optimizing Taraxa's use of disk space.

taraxa-node's People

Stargazers

Watchers

Forkers

alterkahn codegp2018 iska2020 web3validator agrebin nftemic mafiacryptox iiifinal donihellatrigger dldfw halbornteam nikitinivannodes sanurichgit omahs rjonczy cocilea neugepower matuskysel

taraxa-node's Issues

Should not change vrf_pbft_chain_last_block_hash_ when syncing block certified in current round

Description

Right now when we sync a block that comes with cert votes placed in current round we update vrf_pbft_chain_last_block_hash_ to the new block hash during the same round. This is incorrect behavior.

chore(sandbox): compile issue

Description

CMakeLists allowed to create sandbox for taraxa node
https://github.com/Taraxa-project/taraxa-node/blob/develop/CMakeLists.txt#L195

under local directory could create CMakeLists_ext.cmake:
add_executable(sandbox local/sandbox.cpp)
target_link_libraries(
sandbox
app_base
gtest
)

and create sandbox.cpp do help function for taraxa node. For example:
#include
#include
#include
#include
#include "../tests/util_test/util.hpp"
using namespace taraxa;
using namespace std;

int main() {
auto node_cfgs = core_tests::make_node_cfgs(5);
auto nodes = core_tests::launch_nodes(node_cfgs);
for (uint i = 0;; ++i) {
auto node_index = i % nodes.size();
auto &node = nodes[node_index];
cout << "Node " << node_index << " num blocks: " << node->getFinalChain()->last_block_number() << endl;
node->getTransactionManager()->insertTransaction(
Transaction(i, 1, 0, 0, {}, nodes[0]->getSecretKey(), node->getAddress()), true);
this_thread::sleep_for(500ms);
}
return 0;
}

Environment

Node version

Use docker command to find node image version:

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

NOTE: In many cases log files are also useful to include. Since these files may be large, a Taraxa developer may request them later. These files may include public addresses that you're participating with. If that is a concern please be sure to scrub that data.

Steps to reproduce

Expected behaviour

Actual behaviour

The sanbox cannot compile anymore, compile errors:

ld: library not found for -lgtest
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [bin/sandbox] Error 1
make[1]: *** [CMakeFiles/sandbox.dir/all] Error 2
make: *** [all] Error 2

Optimize BlocksPacket and PbftBlockPacket packets

Is your feature request related to a problem? Please describe.

These packets are used for syncing. We are not checking if we already send dag/pbft blocks and transactions to the peers for these packets, which means we are sending them over and over, which creates:

huge packets sizes
useless overhead because of repeated processing
spam in logs

Describe the solution you'd like.

Check if dag/pbft blocks and transactions were already send to the peer. If yes - send only hashes, if not - send full data

Do you have ideas regarding the implementation of this feature?

It will require some changes to the rlp we are sending so we can differentiate how many full txs(or just hashes) we sent with dag block for example....

Get rid of "new" memory allocations

Is your feature request related to a problem? Please describe.

There is many places where new is used to allocate memory for smart pointers, e.g.:
return u_ptr(new DatabaseImpl(move(db), move(column)));

All such allocation might lead to memory leaks in some edge case scenarios...

Describe the solution you'd like.

Use make_shared, make_unique instead...

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

sync_large_pbft_block - failure

Description

sync_large_pbft_block -> I think this test is unstable, because more smaller PBFT blocks are created and current implementation can not guarantee to create one big enough

/opt/taraxa/tests/network_test.cpp:178: Failure
Expected: (total_size) > (MAX_PACKET_SIZE), actual: 200877 vs 15728640

https://app.circleci.com/pipelines/github/Taraxa-project/taraxa-node/1543/workflows/9eff8881-650c-4e8c-b435-747eb20cd097/jobs/1601

Save logs in separate folder

Is your feature request related to a problem? Please describe.

Save logs in data/logs folder, not in data/db folder...

Devnet nodes and testnet nodes connect each other

Description

Found devnet node is syncing with testnet nodes

Environment

Node version

Use docker command to find node image version:

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

Steps to reproduce

Expected behaviour

Actual behaviour

2021-05-18 17:02:42.185822 NEXTVOTESSYNC INFO Received 11 next votes from peer ##9768ec83… node current round 35657, peer pbft round 87923

Dedicated RPC method for liveness and readiness probe

Is your feature request related to a problem? Please describe.

Right now db initialization can cause a very long delay before standard RPC calls are responsive.

Node crashes due to pure virtual method being called

Description

Caught signal 11 (SIGSEGV) Stack Trace: pure virtual method called terminate called without an active exception Caught signal 6 (SIGABRT)

Environment

Docker image

Node version

docker-pullable://gcr.io/jovial-meridian-249123/taraxa-node-develop@sha256:fbf392ebfe2e538c8c033f0c9d247dec470eadc912bcf73cf96384095dd969b9

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

Steps to reproduce

Run node for a while
Wait for crash :-)

Improve vote counting performance when have lots of steps in a round

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

Limit packets size to 16MB

Is your feature request related to a problem? Please describe.

There is a 16MB packet max size limit on networking layer. It can happen on multiple places in our code where we might send bigger packet than that...

Describe the solution you'd like.

if packet is vigger than 16MB, split into multiple packets
make those huge packets more efficient, e.g. "BlocksPacket" sends all unfinalized dag blocks together with all transactions - there is huge chance that most of the txs were already sent to the peer so we can send only hashes and not full txs... There is probably multiple places suitbale for optimization like this in the code

c_node_major_version & c_node_minor_version don't agree with whats generated in taraxad_version.hpp from cmake

Nodes starting value not updated to reflect change in prior round next voted value

Description

In the first finishing step the last conditional value to vote is the node's own starting value. This value is determined at the round start based on previous round's next voted value. When during the round we update our understanding of previous round next voted value, need to update our own starting value if it is currently NULL BLOCK HASH to the previous round next voted block.

DAG proposer blocks should skip verification queue when proposed on local node

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

Print incoming requests

In order to figure out the issue related to the communication, more logs related to the requests (http so far) would be useful. All the information related to the source and whatever result from the server.

Split PBFT block when syncing

Description

The fix of sendPbftBlocks that use one PBFT block as an unit, that includes all DAG blocks and related transactions.
In the current config setup, the max size of DAG blocks in one PBFT block is 100, and the max size of transactions in one DAG block is 250.

Since the max size of one network packet is up to 16MB, I am not sure if that enough for one PBFT block include 100 DAG blocks plus 25000 transactions.
Since in the current PBFT lambda setup, one PBFT finalization is around 4s in average. That will limit TPS up to 6250. In order to include more transactions in one PBFT block to improve TPS, should be able to split PBFT block when syncing.

Environment

Node version

Use docker command to find node image version:

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

Steps to reproduce

Expected behaviour

Split PBFT block when syncing. Should be able to split PBFT block with DAG blocks and transactions.

Actual behaviour

In PBFT block syncing, one PBFT block must include all DAG blocks and all transactions.

Taraxa capability refactor proposal

Is your feature request related to a problem? Please describe.

In light of our recent findings about network being stuck due to delays in processing of consensus-related packets I have a proposal.

It is assumed that consensus is not able to proceed because the votes are not processed fast enough. That is a consequence of our current design, in which we basically process all packets in single thread and if processing of some packets take too long (might be various reasons), consensus gets stuck.

Existing solution

We have separate thread for low level networking stuff like receiving packets, etc... Once the packet is received it is moved to the thread pool with single thread in taraxa capability, which means the networking thread is not blocked anymore. Although this solution separates networking thread from processing thread, it does not really solves our problem because all of the packets are now in single processing thread and they are blocking each other there because they are processed synchronously. If some of them take too long to process it again delays processing of votes, which ultimately leads to the network being stuck.

Proposed solution

Priority queue

There could be a concurrent priority queue, in which we would save packets RLP representations together with it's priority and packet type. So once the networking thread receives new packet all it does is that it parses packet's type, assigns a priority to it and saves such data(packet RLP + type + priority) to the priority queue.

Consensus related packets - highest priority
New Txs/Blocks packets - intermediate priority
StatusPacket, Syncing packets - lowest priority

Real thread pool

Once we have such priority queue, there would be a threadpool with multiple threads and these threads would concurrently process packets from queue. Note that not all of them can be processed in truly concurrent way (e.g syncing packets) but such type of synchronization can be part of polling logic from priority queue... Not all code is ready for such processing, but we can start step by step. For example StatusPacket/GetPbftBlockPacket/... can be processed concurrently with only few changes even now...

Polling from queue

As the name of the queue suggest - "priority queue", we would assign different priorities to different packets. For example consensus-related packets would have the highest priority, which means it would be always polled as first even if some other packets with lower priority (e.g. syncing packets) were supposed to be processed earlier in terms of time...

There would be also other type of synchronization in polling, which would take into consideration dependencies of different packets on each other or the requirement of some specific packet to be processed synchronously(e.g. syncing)...

Processing of polled data

There would be a registered handler function for each type of packet. Ultimately each current "case :" in taraxa_capability would be a different handler. Once we poll packet from queue, we know we can start to process it immediately as all the synchronization logic is done in queue polling. It serves only such packets that are not blocked by ongoing processing of some other packets...

This way we would also break that gigantic function for processing packets into smaller functions (handlers).

Pseudo code

// Networking thread
packet          = async_read()
packet_type 	 = getPacketType(packet)
packet_priority = getPacketPriority(packet_type)

priority_queue.push(packet_type, packet_priority, packet)


// Thread pool processing threads 
[packet_type, packet_rlp] = priority_queue.pop()

handler = getHandler(packet_type)
handler(packet_rlp)

Ultimate goal

We could ultimately get even rid of all those separate threads we are having at the moment (e.g. threads for validating dag blocks, transactions, etc...). Threads are in general very expensive and most of our threads are usually not used. For example even threads that validate new dag blocks are most of the time(in terms of processor time) just waiting...

Sure we would not just refactor everything right away but to have such threadpool that could process packets as they come without the need of intermediate queues might be an interesting idea and we would use those threads in much more efficient way than we do now.

Pro's

We can easily prioritize certain packets so for example consensus related packets would be processed immediately - no delays for consensus
much more efficient threads usage (in case we start to remove some of the intermediate queues + threads that poll from them)
real concurrent processing of packets

Con's

To really take advantage of this design, we would have to adjust some parts of code so it can be truly processsed in concurrent way. Some of the changes would be trivial like for StatusPacket or GetPbftBlockPacket for example, others might be more difficult. If we made processing of all consensus packets concurrent-friendly that should ultimately fix all problems related to the delays we are facing at the moment.

Ability to run a local network of nodes

Is your feature request related to a problem? Please describe.

It's pretty hard right now to set up a local network of boot nodes and RPC nodes in order to test changes you make locally.
You need to create config files for each node, change the chain ID and genesis hash, find the boot node public addresses, add them in all config files in the DPOS section and so on.

Describe the solution you'd like.

It would be nice to have a python script that you can run locally and it can set everything up and start the nodes. Similar to docker-compose but it should work with taraxad binaries.

You can, for example, call it like this:

./cli/local-net --boot-nodes 3 --rpc-nodes 3 <binary>

And it should configure and start 6 nodes locally and print the logs.

Enable/disbable log channels and levels via RPC calls

Problem to be solved

When encountering issues its desired to be able to selectively enable logs that wouldn't normally be desired to be left on due to the large volume of logs they produce. Stopping the node and restarting it with changed config file causes a change of state that makes it hard to generate logs during the original event of interest.

Describe the solution you'd like.

Be able to modify the logging channels and levels via RPC calls.

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

Node attempts to sync with a peer running wrong node version

Description

Node will at times try to sync from node 19313afe even though it can clearly see in processing initial status packet that its running the wrong node version...

2021-05-21 02:22:04.783228 SUMMARY INFO Currently syncing from node ##19313afe…
2021-05-21 02:22:04.783274 SUMMARY INFO Max peer PBFT chain size:      80705 (peer ##19313afe…)
2021-05-21 02:22:04.783323 SUMMARY INFO Max peer PBFT consensus round: 87923 (peer ##19313afe…)
2021-05-21 02:22:04.783363 SUMMARY INFO Max peer DAG level:            294866 (peer ##19313afe…)
2021-05-21 02:22:04.783768 TARCAP INFO Node ##19313afe… disconnected
2021-05-21 02:22:07.328666 TARCAP INFO Node ##19313afe… connected
2021-05-21 02:22:07.447971 TARCAP ERROR Incorrect node version: 0.01, our node major version1.06, host ##19313afe… will be disconnected
2021-05-21 02:22:07.465740 TARCAP INFO Node ##19313afe… disconnected

Environment

Node version

v1.6

Expected behavior

Status packet would be first packet processed and would disconnect immediately if node doesn't pass initial status packet

Actual behavior

Either we don't disconnect immediately, or more likely the initial status packet is NOT always the first packet processed for the node.

feat(pbft): Improve PBFT blocks and previous round next votes syncing

Is your feature request related to a problem? Please describe.

PBFT will sync blocks and previous round next votes in every 100 steps

Describe the solution you'd like.

Should be at exponential backoff. Like every 2 steps, 4 steps, 8 steps, etc

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

Node crash with stack trace

Description

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f136b491859 in __GI_abort () at abort.c:79
#2  0x00007f136b86e951 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f136b87a47c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f136b87a4e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f136b87a799 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00005604781f750a in boost::throw_exception<dev::UndersizeRLP> (e=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/throw_exception.hpp:63
#7  boost::exception_detail::throw_exception_<dev::UndersizeRLP> (x=..., current_function=current_function@entry=0x56047878b130 "dev::RLP::RLP(dev::bytesConstRef, dev::RLP::Strictness)", 
    file=file@entry=0x56047878b0b0 "/opt/taraxa/submodules/taraxa-aleth/libdevcore/RLP.cpp", line=line@entry=32)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/throw_exception.hpp:87
#8  0x00005604781f491b in dev::RLP::RLP (this=<optimized out>, _d=..., _s=<optimized out>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/exception/info.hpp:267
#9  0x0000560478165e3a in dev::p2p::Session::checkPacket (_msg=...) at /opt/taraxa/submodules/taraxa-aleth/libdevcore/vector_ref.h:124
#10 0x000056047816b7cd in dev::p2p::Session::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)>::operator()(boost::system::error_code, std::size_t) const (
    __closure=0x7f12e27fb038, ec=..., length=<optimized out>) at /opt/taraxa/submodules/taraxa-aleth/libp2p/Session.cpp:306
#11 0x000056047816bbc9 in boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >::operator()(const boost::system::error_code &, std::size_t, int) (this=0x7f12e27fb010, ec=..., bytes_transferred=<optimized out>, start=<optimized out>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/system/error_code.hpp:655
#12 0x000056047816d6fc in boost::asio::detail::binder2<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::system::error_code, long unsigned int>::operator() (this=0x7f12e27fb010) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/bind_handler.hpp:162
#13 boost::asio::detail::executor_function<boost::asio::detail::binder2<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::system::error_code, long unsigned int>, std::allocator<void> >::do_complete(boost::asio::detail::executor_function_base *, bool) (base=0x7f12ac0ffa80, call=<optimized out>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/executor_function.hpp:91
#14 0x0000560478135423 in boost::asio::detail::executor_function_base::complete (this=<optimized out>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/executor_function.hpp:32
#15 boost::asio::executor::function::operator() (this=0x7f12e27fb0a8) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/impl/executor.hpp:69
#16 boost::asio::asio_handler_invoke<boost::asio::executor::function> (function=...)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/handler_invoke_hook.hpp:69
#17 boost_asio_handler_invoke_helpers::invoke<boost::asio::executor::function, boost::asio::executor::function> (context=..., function=...)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#18 boost::asio::detail::handler_work<boost::asio::executor::function, boost::asio::system_executor, boost::asio::system_executor>::complete<boost::asio::executor::function> (this=<synthetic pointer>, 
    handler=..., function=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_work.hpp:100
#19 boost::asio::detail::completion_handler<boost::asio::executor::function>::do_complete (owner=0x56047ccd2b30, base=<optimized out>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/completion_handler.hpp:70
#20 0x0000560478135690 in boost::asio::detail::strand_service::dispatch<boost::asio::executor::function> (this=0x56047ccd7b30, impl=@0x7f12a0106730: 0x7f12a0005b20, handler=...)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/system/error_code.hpp:579
#21 0x00005604781357b7 in boost::asio::io_context::strand::dispatch<boost::asio::executor::function, std::allocator<void> > (f=..., a=..., this=0x7f12a0106728)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/impl/executor.hpp:54
#22 boost::asio::executor::impl<boost::asio::io_context::strand, std::allocator<void> >::dispatch (this=this@entry=0x7f12a0106710, f=...)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/impl/executor.hpp:184
#23 0x000056047816d295 in boost::asio::executor::dispatch<boost::asio::detail::binder2<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::system::error_code, long unsigned int>, std::allocator<void> > (a=..., f=..., this=<synthetic pointer>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/executor_function.hpp:61
#24 boost::asio::detail::io_object_executor<boost::asio::executor>::dispatch<boost::asio::detail::binder2<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::system::error_code, long unsigned int>, std::allocator<void> > (a=..., f=..., this=<synthetic pointer>)
    at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/io_object_executor.hpp:128
#25 boost::asio::detail::handler_work<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::asio::detail::io_object_executor<boost::asio::executor>, boost::asio::detail::io_object_executor<boost::asio::executor> >::complete<boost::asio::detail::binder2<boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::system::error_code, long unsigned int> > (handler=..., function=..., this=<synthetic pointer>)
/root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_work.hpp:72
#26 boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::executor>, boost::asio::mutable_buffers_1, const boost::asio::mutable_buffer*, boost::asio::detail::transfer_all_t, dev::p2p::Session::doRead()::<lambda(boost::system::error_code, std::size_t)>::<lambda(boost::system::error_code, std::size_t)> >, boost::asio::detail::io_object_executor<boost::asio::executor> >::do_complete(void *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=<optimized out>, base=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/reactive_socket_recv_op.hpp:123
#27 0x000056047812c39f in boost::asio::detail::scheduler_operation::complete (bytes_transferred=0, ec=..., owner=0x56047ccd2b30, this=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/scheduler_operation.hpp:40
#28 boost::asio::detail::scheduler::do_poll_one (ec=..., this_thread=..., lock=..., this=0x56047ccd2b30) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/impl/scheduler.ipp:581
#29 boost::asio::detail::scheduler::poll (this=0x56047ccd2b30, ec=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/impl/scheduler.ipp:267
#30 0x000056047811cc3a in boost::asio::io_context::poll (this=0x56047ccce2d8) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/impl/io_context.ipp:93
#31 dev::p2p::Host::do_work (this=0x56047ccce2b0) at /opt/taraxa/submodules/taraxa-aleth/libp2p/Host.cpp:146
#32 0x000056047774162f in taraxa::Network::<lambda()>::operator() (__closure=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#33 std::_Function_handler<void(), taraxa::Network::Network(const taraxa::NetworkConfig&, const std::filesystem::__cxx11::path&, const dev::KeyPair&, std::shared_ptr<taraxa::DbStorage>, std::shared_ptr<taraxa::PbftManager>, std::shared_ptr<taraxa::PbftChain>, std::shared_ptr<taraxa::VoteManager>, std::shared_ptr<taraxa::NextVotesForPreviousRound>, std::shared_ptr<taraxa::DagManager>, std::shared_ptr<taraxa::DagBlockManager>, std::shared_ptr<taraxa::TransactionManager>)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#34 0x000056047775f2f5 in std::function<void ()>::operator()() const (this=0x7f12a02fd1f0) at /usr/include/c++/9/bits/std_function.h:683
#35 taraxa::util::ThreadPool::<lambda()>::operator() (__closure=0x7f12a02fd1f0) at /opt/taraxa/src/util/thread_pool.cpp:68
#36 std::_Function_handler<void(), taraxa::util::ThreadPool::post_loop(const taraxa::util::ThreadPool::Periodicity&, std::function<void()>)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#37 0x000056047775f76a in std::function<void (boost::system::error_code const&)>::operator()(boost::system::error_code const&) const (__args#0=..., this=0x7f12e27fb518) at /usr/include/c++/9/bits/std_function.h:683
#38 taraxa::util::ThreadPool::<lambda(const auto:2&)>::operator()<boost::system::error_code> (err_code=..., __closure=0x7f12e27fb510) at /opt/taraxa/src/util/thread_pool.cpp:51
#39 boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code>::operator() (this=0x7f12e27fb510) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/bind_handler.hpp:65
#40 boost::asio::asio_handler_invoke<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code> > (function=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/handler_invoke_hook.hpp:69
#41 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code>, taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)> > (context=..., function=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#42 boost::asio::detail::asio_handler_invoke<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code>, taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code> (this_handler=0x7f12e27fb510, function=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/bind_handler.hpp:106
#43 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code>, boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code> > (context=..., function=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#44 boost::asio::detail::io_object_executor<boost::asio::executor>::dispatch<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code>, std::allocator<void> > (a=..., f=..., this=<synthetic pointer>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/io_object_executor.hpp:119
#45 boost::asio::detail::handler_work<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::asio::detail::io_object_executor<boost::asio::executor>, boost::asio::detail::io_object_executor<boost::asio::executor> >::complete<boost::asio::detail::binder1<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::system::error_code> > (handler=..., function=..., this=<synthetic pointer>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/handler_work.hpp:72
#46 boost::asio::detail::wait_handler<taraxa::util::ThreadPool::post(uint64_t, taraxa::util::ThreadPool::asio_callback)::<lambda(const auto:2&)>, boost::asio::detail::io_object_executor<boost::asio::executor> >::do_complete(void *, boost::asio::detail::operation *, const boost::system::error_code &, std::size_t) (owner=0x56047ad79350, base=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/wait_handler.hpp:73
#47 0x00005604775cd08d in boost::asio::detail::scheduler_operation::complete (bytes_transferred=<optimized out>, ec=..., owner=0x56047ad79350, this=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/scheduler_operation.hpp:40
#48 boost::asio::detail::scheduler::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/impl/scheduler.ipp:447
#49 boost::asio::detail::scheduler::run (this=0x56047ad79350, ec=...) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/detail/impl/scheduler.ipp:200
#50 0x000056047775e246 in boost::asio::io_context::run (this=<optimized out>) at /root/.conan/data/boost/1.71.0/_/_/package/713919c5dcae28d8b809f1460df68ed236264c47/include/boost/asio/impl/io_context.ipp:63
#51 taraxa::util::ThreadPool::<lambda()>::operator() (__closure=<optimized out>) at /opt/taraxa/src/util/thread_pool.cpp:20
#52 std::__invoke_impl<void, taraxa::util::ThreadPool::start()::<lambda()> > (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#53 std::__invoke<taraxa::util::ThreadPool::start()::<lambda()> > (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#54 std::thread::_Invoker<std::tuple<taraxa::util::ThreadPool::start()::<lambda()> > >::_M_invoke<0> (this=<optimized out>) at /usr/include/c++/9/thread:244
#55 std::thread::_Invoker<std::tuple<taraxa::util::ThreadPool::start()::<lambda()> > >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#56 std::thread::_State_impl<std::thread::_Invoker<std::tuple<taraxa::util::ThreadPool::start()::<lambda()> > > >::_M_run(void) (this=<optimized out>) at /usr/include/c++/9/thread:195
#57 0x00007f136b8a6d84 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#58 0x00007f136b7b6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#59 0x00007f136b58e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Environment

Devnet, boot node 2

Node version

taraxa-node:pr-792-build-3

Dockerfile build cannot find clang-format & clang-tidy

Description

clang-format and clang-tidy is not found and therefore ignored. Our PR's are passing circleCI build with invalid formatting...

As we are installing only llvm now and not apt-get install clang-format/clang-tidy it seems llvm does not install it at all or not into the default directories as apt-get install...

https://github.com/Taraxa-project/taraxa-node/blob/develop/Dockerfile#L29

Implement protection from "invalid dag block" possible attack

Description

During syncing we do validate dag blocks in terms of it's tips/pivot. However for newly propagated block we do not process this check. As a result if someone send dag block with non-existent tip it would get into unverified_queue and block "block_worker" thread as it cannot build in-memory dag in case it encounters invalid dag block... There is no mechanism to solve such situations and block_worker thread just keeps trying to process the same invalid block over and over again(original approach) or just indefinitely waits for it's missing tip to arrive(Miho's fix). In both cases the node gets stuck... In case we receive block that does not pass this check, we should probably not add it to the unverified_queue at all

Check to be processed in newly propagated block:
https://github.com/Taraxa-project/taraxa-node/blob/develop/src/network/taraxa_capability.cpp#L125

Miho's fix:
https://github.com/Taraxa-project/taraxa-node/pull/931/files

What to think about:

is Miho's fix reasonable ? It might be in terms of saving processor but in terms of solving the bug it is not I guess. If we want to keep it, lets put there at least some log message so we know we are stuck there, if not let's revert it
Can we receive new dag block and not have received all of its tips/pivot ? If yes, it is invalid block but not maliciously invaldit so dropping it might not be the best solution at this case. Can we somehow save it and process it later maybe ?

Status message race condition

Recently a bug was fixed that was caused by transactions being sent before the status message which internally made us ignore the transactions. Investigation is needed if the same issue can happen for other type of packets like dag blocks, pbft blocks, votes, .....

Logs default behaviour - warning + error logs enabled for all channels

Is your feature request related to a problem? Please describe.

At the moment we dont see error and warning logs if nor explicitly enabled by channel config...

Describe the solution you'd like.

Enable error and warning logs for all channels as default behaviour. If channel config is provided, it would override default behaviour...

Restrict branches access

I want to propose new changes. I created set of rules(settings) for different branches in taraxa-node github repo. There is different settings for:

master,
develop,
feature/*,
release/*,
hotfix/*
/ (all other branches)

I would also propose to give everyone in here "Write" access and only 1 person with "Admin" access in both EU and US timezones. Based on the new roles and branches settings it would pretty easy and straightforward to keep our repo clean, structured and without any hacks like direct commits to develop or master -> can't think of any situation when this should be needed...

What we can achieve by this:

only admins can do any sort of hacks like direct commits to protected branches, etc... (but they really should not even if they can....)
master will have restricted write access - we will specify set of people who can write (even merge PR's) to master - let's say admins
people with write access can create and merge PR's to develop but only those PR's that have 2 approvals and all checks passed, no direct commits will be technically possible for them
it wont be possible(easily) even for admins to merge PR's without at least 2 approvals and all checks passed - I believe we should really start doing proper reviews instead of merging stuff without approvals. This way people will be enforced to ask other team member to make a review and give their PR approval...

I understand the fact that this should be strictly required once we have some real stable version in master as our testnet is using images from master and at the moment we are simply hot-fixing things... But cannot we restart testnet with provided custom image tag so we dont need to merge all these "fix tries" into master as we did few days ago ? @rattrap @agrebin that is question for you guys I guess...

Ensure status message is sent before all packet types, including pbft votes

Is your feature request related to a problem? Please describe.

We fix this for transactions in PR #941 but we should do this for all other packet types where we track the peer "known" status because otherwise it can lead to loss of sync/consensus.

Refactor stack/back trace printing

Is your feature request related to a problem? Please describe.

We are currently getting backtrace without func name and line, see below:

Caught signal 11 (SIGSEGV)
Stack Trace: 
false
  taraxad                        (                                           + 0xe53849)  [0x55632545b849]
  taraxad                        (                                           + 0xe425b8)  [0x55632544a5b8]
  taraxad                        (                                           + 0xe43162)  [0x55632544b162]
  taraxad                        (                                           + 0xe5a543)  [0x556325462543]
  taraxad                        (                                           + 0xe5a7b1)  [0x5563254627b1]
  taraxad                        (                                           + 0xe5a997)  [0x556325462997]
  taraxad                        (                                           + 0xe42fbb)  [0x55632544afbb]
  taraxad                        (                                           + 0x348593)  [0x556324950593]
  taraxad                        (                                           + 0xe5149f)  [0x55632545949f]
  taraxad                        (                                           + 0xe423da)  [0x55632544a3da]
  taraxad                        (                                           + 0x46896f)  [0x556324a7096f]
  taraxad                        (                                           + 0x486395)  [0x556324a8e395]
  taraxad                        (                                           + 0x48680a)  [0x556324a8e80a]
  taraxad                        (                                           + 0x2f2ced)  [0x5563248faced]
  taraxad                        (                                           + 0x4852e6)  [0x556324a8d2e6]
  /lib/x86_64-linux-gnu/libstdc++.so.6 (                                           + 0xd6d84)  [0x7f8580075d84]
  /lib/x86_64-linux-gnu/libpthread.so.0 (                                           + 0x9609)  [0x7f857ff85609]
  /lib/x86_64-linux-gnu/libc.so.6 ( clone                                     + 0x43  )  [0x7f857fd5d293]

Describe the solution you'd like.

Display fun name and line

Do you have ideas regarding the implementation of this feature?

get func name and line from linux backtrace. We are using:
https://man7.org/linux/man-pages/man3/backtrace.3.html
or use boost backtrace instead:
https://www.boost.org/doc/libs/1_76_0/doc/html/stacktrace/getting_started.html#stacktrace.getting_started.how_to_print_current_call_stack

Are you willing to implement this feature?

Votes from the "distant future" are not discarded by the node

Description

If a node "maliciously" places PBFT votes in rounds way ahead of actual network PBFT round those votes will be forever kept in unverified queue.

Expected behaviour

A node that is confident it has been in sync should be able to place a bound on what are reasonable votes and discard unreasonably futuristic votes.

Actual behaviour

These malicious votes remain in unverified queue and are validated over and over without success.

Inconsistent summary info

Description

Status is inconsistent, once is printing actively syncing, then it prints synced and then again syncing ...

Node version

develop

Actual behaviour

2021-05-15 10:58:53.223093 SUMMARY SILENT STATUS: GOOD. ACTIVELY SYNCING
2021-05-15 10:59:13.815418 SUMMARY INFO Connected to 2 peers
2021-05-15 10:59:13.815505 SUMMARY INFO Syncing for 59 seconds, 68% synced
2021-05-15 10:59:13.815562 SUMMARY INFO Currently syncing from node ##009f72d5…
2021-05-15 10:59:13.815609 SUMMARY INFO Max peer PBFT chain size:      7122 (peer ##009f72d5…)
2021-05-15 10:59:13.815658 SUMMARY INFO Max peer PBFT consensus round: 12271 (peer ##009f72d5…)
2021-05-15 10:59:13.815706 SUMMARY INFO Max peer DAG level:            23583 (peer ##009f72d5…)
2021-05-15 10:59:13.815756 SUMMARY INFO In the last 19 seconds...
2021-05-15 10:59:13.815809 SUMMARY INFO PBFT sync period progress:     40
2021-05-15 10:59:13.815858 SUMMARY INFO PBFT chain blocks added:       40
2021-05-15 10:59:13.815909 SUMMARY INFO PBFT rounds advanced:          0
2021-05-15 10:59:13.815957 SUMMARY INFO DAG level growth:              137
2021-05-15 10:59:13.816008 SUMMARY INFO ##################################
2021-05-15 10:59:13.816057 SUMMARY SILENT STATUS: GOOD. ACTIVELY SYNCING
2021-05-15 10:59:33.796199 SUMMARY INFO Connected to 2 peers
2021-05-15 10:59:33.796292 SUMMARY INFO In sync since launch for 30% of the time
2021-05-15 10:59:33.796338 SUMMARY INFO Queued unverified transaction: 0
2021-05-15 10:59:33.796391 SUMMARY INFO Queued verified transaction:   32107
2021-05-15 10:59:33.796435 SUMMARY INFO Max DAG block level in DAG:    14102
2021-05-15 10:59:33.796479 SUMMARY INFO Max DAG block level in queue:  14138
2021-05-15 10:59:33.796527 SUMMARY INFO PBFT chain size:               4970
2021-05-15 10:59:33.796578 SUMMARY INFO Current PBFT round:            5290
2021-05-15 10:59:33.796632 SUMMARY INFO DPOS total votes count:        3
2021-05-15 10:59:33.796666 SUMMARY INFO PBFT consensus 2t+1 threshold: 3
2021-05-15 10:59:33.796703 SUMMARY INFO Node elligible vote count:     0
2021-05-15 10:59:33.796736 SUMMARY INFO In the last 19 seconds...
2021-05-15 10:59:33.796773 SUMMARY INFO PBFT chain blocks added:       60
2021-05-15 10:59:33.796804 SUMMARY INFO PBFT rounds advanced:          0
2021-05-15 10:59:33.796840 SUMMARY INFO DAG level growth:              186
2021-05-15 10:59:33.796871 SUMMARY INFO ##################################
2021-05-15 10:59:33.796899 SUMMARY SILENT STATUS: GOOD. NODE SYNCED
2021-05-15 10:59:53.777009 SUMMARY INFO Connected to 2 peers
2021-05-15 10:59:53.777130 SUMMARY INFO Syncing for 19 seconds, 70% synced
2021-05-15 10:59:53.777175 SUMMARY INFO Currently syncing from node ##009f72d5…
2021-05-15 10:59:53.777214 SUMMARY INFO Max peer PBFT chain size:      7122 (peer ##009f72d5…)
2021-05-15 10:59:53.777256 SUMMARY INFO Max peer PBFT consensus round: 12271 (peer ##009f72d5…)
2021-05-15 10:59:53.777303 SUMMARY INFO Max peer DAG level:            23583 (peer ##009f72d5…)
2021-05-15 10:59:53.777358 SUMMARY INFO In the last 19 seconds...
2021-05-15 10:59:53.777406 SUMMARY INFO PBFT sync period progress:     70
2021-05-15 10:59:53.777443 SUMMARY INFO PBFT chain blocks added:       71
2021-05-15 10:59:53.777496 SUMMARY INFO PBFT rounds advanced:          0
2021-05-15 10:59:53.777548 SUMMARY INFO DAG level growth:              221
2021-05-15 10:59:53.777594 SUMMARY INFO ##################################
2021-05-15 10:59:53.777639 SUMMARY SILENT STATUS: GOOD. ACTIVELY SYNCING

Transactions(+ statutes) db storage optimization

Is your feature request related to a problem? Please describe.

Current simplified flow when received broadcasted transactions:

transaction is received, saved into the local unverified queue and database (+ tx status "unverified" is saved in db as well)
once it is verified in separate thread, it is moved from local unverified queue to the verified queue and tx status in db is set to "verified", or "invalid" in case verification did not pass
once it is included in dag block, tx status is set to "in_block"

Cons

a lot of database handling (locking) due to transaction status separate storage

Describe the solution you'd like.

Proposed new flow when received broadcasted transactions:

transaction is received, saved into the local unverified queue
once it is verified in separate thread, it is moved from local unverified queue to the verified queue and transaction is also saved in db -> this means that only valid verified tx is saved in db

Pros

we get rid of transaction status handling, which is redundant information because "unverified/verified" status we know based on in which queue is tx currently stored + only verified txs are stored in db as well. "in_block" status is used only as a synchronization mechanism during proposing new dag block so we dont include txs that we received in other dag block in the meantime -> this can be easily solved without involving rocksdb
as we store in db only verified transactions, we can store also verified signature (public key) and during execution of pbft block we dont need to verify these signatures again as we do now
some smaller optimizations would be also possible in dag blocks handling....

Logging inside of callback in sealAndSend doesn't produce logs

Description

We didn't actually ever execute any callback after sending the data!

Rare heap corruption in the p2p test

In #656 there's a very rare error that I saw in p2p_test. The error reads:

malloc(): mismatching next->prev_size (unsorted)
Signal: SIGABRT (Aborted)

Once I saw it, i tried running that test on a loop, and catching the crash in a debugger. After 30 min to 2 hours of repeatedly running the test, the error shows itself, always in different spots. For example, I saw it here https://github.com/Taraxa-project/taraxa-aleth/blob/050ddac66c4e76563f125ebc04afab942bd602b4/libp2p/RLPxHandshake.cpp#L140
or here

taraxa-node/src/util/thread_pool.cpp

Line 42 in 8f08f90

boost::asio::post(ioc_, [this, action = std::move(action)] {

Googling tells me that this is a kind a kind of error that could be set up in one place as a "time bomb", and reveal itself completely unexpectedly in the other place after a time. Apparently, the heap gets messed somewhere, and then on any subsequent allocation there's some probability to get this error.

Debugging such an error postmortem seems hardly possible. Instead, I suggest using memory sanitizers or similar tools to eliminate any possible heap problems.

Unverified votes queue and DB not consistent issue

Description

When network receive votes, find in the DB but not in the unverified queue

Environment

Node version

Use docker command to find node image version:

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

Steps to reproduce

Expected behaviour

Actual behaviour

[Stack trace] Node crash in networking

Description

Node crashed and produced the following stack trace:

Caught signal 11 (SIGSEGV)
2021-05-06 19:20:56.346448 DAGSYNC DEBUG Received status message from ##2eb98532… peer DAG max level:1931
2021-05-06 19:20:56.346656 PBFTSYNC DEBUG Received status message from ##2eb98532…, peer sycning: false, peer PBFT chain size:781
2021-05-06 19:20:56.732125 DAGSYNC DEBUG Received status message from ##d8c2b208… peer DAG max level:1932
2021-05-06 19:20:56.732302 PBFTSYNC DEBUG Received status message from ##d8c2b208…, peer sycning: false, peer PBFT chain size:781
 0# abortHandler(int) in taraxad
 1# 0x00007F9466206210 in /lib/x86_64-linux-gnu/libc.so.6
 2# std::_Rb_tree<dev::p2p::Peer*, dev::p2p::Peer*, std::_Identity<dev::p2p::Peer*>, std::less<dev::p2p::Peer*>, std::allocator<dev::p2p::Peer*> >::_M_lower_bound(std::_Rb_tree_node<dev::p2p::Peer*> const*, std::_Rb_tree_node_base const*, dev::p2p::Peer* const&) const in taraxad
 3# std::_Rb_tree<dev::p2p::Peer*, dev::p2p::Peer*, std::_Identity<dev::p2p::Peer*>, std::less<dev::p2p::Peer*>, std::allocator<dev::p2p::Peer*> >::find(dev::p2p::Peer* const&) const in taraxad
 4# std::set<dev::p2p::Peer*, std::less<dev::p2p::Peer*>, std::allocator<dev::p2p::Peer*> >::count(dev::p2p::Peer* const&) const in taraxad
 5# dev::p2p::Host::connect(std::shared_ptr<dev::p2p::Peer> const&) in taraxad
 6# dev::p2p::Host::main_loop_body() in taraxad
 7# 0x0000561AD83BA84B in taraxad
 8# 0x0000561AD83C2777 in taraxad
 9# 0x0000561AD83C249A in taraxad
10# boost::asio::detail::executor_function_base::complete() in taraxad
11# boost::asio::executor::function::operator()() in taraxad
12# void boost::asio::asio_handler_invoke<boost::asio::executor::function>(boost::asio::executor::function&, ...) in taraxad
13# void boost_asio_handler_invoke_helpers::invoke<boost::asio::executor::function, boost::asio::executor::function>(boost::asio::executor::function&, boost::asio::executor::function&) in taraxad
14# void boost::asio::detail::handler_work<boost::asio::executor::function, boost::asio::system_executor, boost::asio::system_executor>::complete<boost::asio::executor::function>(boost::asio::executor::function&, boost::asio::executor::function&) in taraxad
15# boost::asio::detail::completion_handler<boost::asio::executor::function>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) in taraxad
16# void boost::asio::detail::strand_service::dispatch<boost::asio::executor::function>(boost::asio::detail::strand_service::strand_impl*&, boost::asio::executor::function&) in taraxad
17# void boost::asio::io_context::strand::dispatch<boost::asio::executor::function, std::allocator<void> >(boost::asio::executor::function&&, std::allocator<void> const&) const in taraxad
18# boost::asio::executor::impl<boost::asio::io_context::strand, std::allocator<void> >::dispatch(boost::asio::executor::function&&) in taraxad
19# 0x0000561AD83C08FF in taraxad
20# 0x0000561AD83BFFC3 in taraxad
21# 0x0000561AD83BF58F in taraxad
22# 0x0000561AD83BDA85 in taraxad
23# boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) in taraxad
24# boost::asio::detail::scheduler::do_poll_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) in taraxad
25# boost::asio::detail::scheduler::poll(boost::system::error_code&) in taraxad
26# boost::asio::io_context::poll() in taraxad
27# dev::p2p::Host::do_work() in taraxad
28# 0x0000561AD799FA39 in taraxad
29# 0x0000561AD79A4349 in taraxad
30# std::function<void ()>::operator()() const in taraxad
31# 0x0000561AD79C92FC in taraxad
32# 0x0000561AD79C9CC2 in taraxad
33# std::function<void ()>::operator()() const in taraxad
34# 0x0000561AD79C9196 in taraxad
35# 0x0000561AD79C9B62 in taraxad
36# std::function<void (boost::system::error_code const&)>::operator()(boost::system::error_code const&) const in taraxad
37# 0x0000561AD79CA2EA in taraxad
38# 0x0000561AD79CC1DD in taraxad
39# 0x0000561AD79CBEF5 in taraxad
40# 0x0000561AD79CBD95 in taraxad
41# 0x0000561AD79CBB6A in taraxad
42# 0x0000561AD79CB91C in taraxad
43# 0x0000561AD79CB7E6 in taraxad
44# 0x0000561AD79CB645 in taraxad
45# 0x0000561AD79CB025 in taraxad
46# boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) in taraxad
47# boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) in taraxad
48# boost::asio::detail::scheduler::run(boost::system::error_code&) in taraxad
49# boost::asio::io_context::run() in taraxad
50# 0x0000561AD79C8C21 in taraxad
51# 0x0000561AD79CC5E2 in taraxad
52# 0x0000561AD79CC597 in taraxad
53# 0x0000561AD79CC544 in taraxad
54# 0x0000561AD79CC51A in taraxad
55# 0x0000561AD79CC4FE in taraxad
56# 0x00007F94665FAD84 in /lib/x86_64-linux-gnu/libstdc++.so.6
57# 0x00007F946650A609 in /lib/x86_64-linux-gnu/libpthread.so.0
58# clone in /lib/x86_64-linux-gnu/libc.so.6

Environment

k8s/docker

Node version

Docker image for commit 4dab740

Include node’s public address in SUMMARY logs

Is your feature request related to a problem? Please describe.

This makes it even easier for user to find node’s address and makes random log snippets more likely to be useful during testnet.

Describe the solution you'd like.

Do you have ideas regarding the implementation of this feature?
Are you willing to implement this feature?

Additional context.

Some RPC requests error out

Description

Some RPC requests (normal transactions and delegation transactions) error out with the following message:

Error: Returned error: Unhandled API exception: /opt/taraxa/submodules/taraxa-aleth/./libdevcore/RLP.h(379): Throw in function _N dev::RLP::toHash(int) const [with _N = dev::FixedHash<32>]
Dynamic exception type: boost::wrapexcept<dev::BadCast>

2021-06-18 01:30:27.877050 RPC ERROR POST internal error. Request: {"jsonrpc":"2.0","id":789653,"method":"eth_sendRawTransaction","params":["0xf86d83038c67843b9aca0082520894f5fed8338cb9a4a8d3e158f84b6e12c04a586bef88016345785d8a00008027a098da56a286b8f79813d5784be23bfe024225cb6fd6560e67b05773f3827b0e9b9f50fe3c81006a77bf4aa5de63cd1ee73270bf142015b8cfe349704014b8be9b"]}. Message: Unhandled API exception: /opt/taraxa/submodules/taraxa-aleth/./libdevcore/RLP.h(379): Throw in function _N dev::RLP::toHash(int) const [with _N = dev::FixedHash<32>]
Dynamic exception type: boost::wrapexcept<dev::BadCast>
. Extra data: {}

Environment

Node version

develop latest

Nodes crash due to memory consumption

Description

Once network gets stuck, nodes start to consume more and more memory until it reaches 8GB, which causes a crash.

Environment

Node version

Use docker command to find node image version:

docker images --format '{{.ID}}' 'taraxa/taraxa-node:latest'

Operating System details.

CPU, memory, disk details.

Steps to reproduce

Expected behaviour

Actual behaviour

Node should not start if invalid VRF secret key supplied in config

Extraneous guards and checks within taraxa capability should be cleaned up

Is your feature request related to a problem? Please describe.

See comments on PR #941 for example:

if ((!peer) || (dag_mgr_ && !peer->received_initial_status_ && packet_type != StatusPacket))

In general we have a lot of checks in code just because of tests, which should not be there or at least it should be somehow checked only if we are running tests through some #ifdef TESTS or something like that.

Add a log message when the node starts

An initial log message when the node starts would be very useful to directly check if the docker container is running in the right way.

Excessive invalid future votes bogs down PBFT execution

Description

Node sending (invalid) votes for distant future rounds leads to unbounded accumulation of votes in the unverified queue. Repeatedly trying to validate these votes brings PBFT execution to a crawl!

Expected behaviour

Remove inavlid/unnecessary future votes.

Use clang for linux builds

Clang for linux build

Clang have more strict checks on compile and we don't have a clang build. So we can get a situation when the project won't build on a clang compiler and macOS(for example for releases). So best solution will be to use clang compiler for CI.

Additional context

Build docs should be also updated to use clang compiler.

ethash compilation issues

When the application is being compiled in Linux (Docker) it fails because the ethash library requires a new flag (-fPIC). I changed the line 224 of Makefile in order to try to solve this issue (cd submodules/ethash/build; cmake CXXFLAGS="-fPIC -pthread -pipe -c" ..; cmake CXXFLAGS="-fPIC -pthread -pipe -c" --build .) but I got a new issue:

g++ -std=c++17 obj/rocks_db.o obj/state_block.o obj/util.o obj/udp_buffer.o obj/network.o obj/full_node.o obj/types.o obj/visitor.o obj/dag.o obj/block_proposer.o obj/rpc.o obj/grpc_client.o obj/grpc_server.o obj/grpc_util.o obj/transaction.o obj/taraxa_grpc.pb.o obj/taraxa_grpc.grpc.pb.o  `pkg-config --cflags protobuf grpc++ --libs protobuf grpc++` -lgrpc++_reflection -ldl -I./grpc obj/libdevcore/Address.o obj/libdevcore/Base64.o obj/libdevcore/Common.o obj/libdevcore/CommonData.o obj/libdevcore/CommonIO.o obj/libdevcore/CommonJS.o obj/libdevcore/DBFactory.o obj/libdevcore/FileSystem.o obj/libdevcore/FixedHash.o obj/libdevcore/Guards.o obj/libdevcore/JsonUtils.o obj/libdevcore/LevelDB.o obj/libdevcore/Log.o obj/libdevcore/LoggingProgramOptions.o obj/libdevcore/MemoryDB.o obj/libdevcore/OverlayDB.o obj/libdevcore/RLP.o obj/libdevcore/RocksDB.o obj/libdevcore/SHA3.o obj/libdevcore/StateCacheDB.o obj/libdevcore/TransientDirectory.o obj/libdevcore/TrieCommon.o obj/libdevcore/TrieHash.o obj/libdevcore/Worker.o obj/libdevcrypto/AES.o obj/libdevcrypto/Common.o obj/libdevcrypto/CryptoPP.o obj/libdevcrypto/Hash.o obj/libdevcrypto/LibSnark.o obj/libdevcrypto/SecretStore.o obj/libp2p/CapabilityHost.o obj/libp2p/Common.o obj/libp2p/Host.o obj/libp2p/Network.o obj/libp2p/NodeTable.o obj/libp2p/Peer.o obj/libp2p/RLPXFrameCoder.o obj/libp2p/RLPxHandshake.o obj/libp2p/Session.o obj/libp2p/UDP.o obj/libp2p/UPnP.o obj/main.o -o build/main -L submodules/cryptopp -L submodules/ethash/build/lib/ethash -L submodules/libff/build/libff -L submodules/secp256k1/.libs -DBOOST_LOG_DYN_LINK -lboost_log -lleveldb -lrocksdb -lsecp256k1 -lgmp -lscrypt -lpthread -lboost_program_options -lboost_filesystem -lboost_system -lcryptopp -lethash -lff -lgtest -lboost_thread-mt -lrocksdb
/usr/bin/ld: cannot find -lethash
collect2: error: ld returned 1 exit status
make: *** [Makefile:239: build/main] Error 1

Introduce git flow (branches, commits, ...) conventions

Is your feature request related to a problem? Please describe.

Introduce git flow conventions that should be used by all devs when working with taraxa-node repo.

Some of the pro's:

create unified set of rules/conventions for taraxa-node repo -> better readability
automatic changelog generation when creating releases
automatic github issues linking in PR's and vice versa

Describe the solution you'd like.

There will be doc with these conventions and explanations how to use them. All devs should get familiar with it.

Next vote sync fails

Description

Next vote sync fails with one vote pointing to same previous blockhash as value being next voted.

Environment

Devnet

Node version

78fce80bfa8b99fce60ce22b4fe9c4c37e2a7e04

Steps to reproduce

Launch devnet
Delegate to nodes
Wait until can not successfully sync next votes

Distinguish between build for tests and normal one

Is your feature request related to a problem? Please describe.

We have a lot of test related code in our main library. We should somehow distinguish between "hacks" that there are for testing purposes only and normal flow.

Describe the solution you'd like.

I think introducing #ifdef and #ifndef directives could be useful e.g.

#ifdef TEST_ONLY
some test related code
#endif

Merged with issue-958:
See comments on PR #941 for example:

if ((!peer) || (dag_mgr_ && !peer->received_initial_status_ && packet_type != StatusPacket))

Separate transactions processing from network thread

Is your feature request related to a problem? Please describe.

There are significant delays in other packets processing due to occasional long TransactionPacket processing time.

Describe the solution you'd like.

make locking mechanism while accessing txs in db more efficient
move incoming txs processing out of network thread

Incorrectly handle peers with wrong network version or dag genesis hash

Description

When we connect to a peer with invalid network version or dag genesis hash we report "will be disconnected" but don't actually disconnect before looking at such a peer to evaluate if should sync to them!

Node version

7b8cea0f05387ffb2d97dac9a1a3c8633b2b2d92

Steps to reproduce

Have a node with incorrect dag genesis hash join network with a larger DAG and/or final chain
Observe that other nodes will attempt to sync from it!

Expected behavior

We should break out of status packet handling when we detect incorrect network or genesis hash BEFORE we consider if we should sync from it.

Actual behavior

We do not break out of packet and go onto decide if should sync from it.