Let's consider a cluster with 3 nodes and the following scenario:

As <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Cluster cannot start if some nodes lost their files and log has been truncated in remaining nodes about aeron HOT 5 CLOSED

pcdv commented on June 3, 2024

Cluster cannot start if some nodes lost their files and log has been truncated in remaining nodes

from aeron.

Comments (5)

mjpt777 commented on June 3, 2024

This is deliberate corruption of the state and not supported. If this happens you would need to recover from backup or use our warm standby premium feature.

from aeron.

pcdv commented on June 3, 2024

Not sure I would call that corruption since the state fully exists in one node: it just needs to be replicated to the other nodes that don't have any state at all. It feels like not much is missing to make it work (but I don't know enough the cluster internals to know for sure).

you would need to recover from backup

Actually, this problem occurred while recovering from backup, but it boils down to the test above.

The initial scenario was:

start 3 nodes
start one ClusterBackup node, replicating the state
stop everything
forget about the 3 initial nodes
set up a new 3-node cluster where one node reuses the directories generated by ClusterBackup and 2 other nodes start empty

This scenario works only if the ingress log is never truncated.

An alternative would be start 3 ClusterBackup nodes but it seems overkill to replicate the state 3 times.

Is ClusterBackup an actually viable solution or only the premium features allow to have a reliable backup?

from aeron.

mikeb01 commented on June 3, 2024

Not sure I would call that corruption since the state fully exists in one node

This is where the example breaks down. Raft and other similar consensus algorithms that handle fail-stop type faults can only handle 1 failure in a 3 node cluster. You have a scenario where the 3 node cluster has 2 failed nodes. This is beyond what the algorithm has the ability to correctly and automatically recover from. Therefore you would need to fall back to manually fixing the system.

An alternative would be start 3 ClusterBackup nodes but it seems overkill to replicate the state 3 times.

The premium Cluster Standby would create 3 replicated copies of the data for the scenarios where the user wants to have another cluster that can they can fail over to. It has some functionality to support daisy chain style replication to reduce load on the primary cluster and potential WAN bandwidth consumption. However, we would not see having 3 replicated copies of the state as overkill.

from aeron.

pcdv commented on June 3, 2024

Thank you for your detailed answer.

This is beyond what the algorithm has the ability to correctly and automatically recover from.

In my initial tests with ClusterBackup, this scenario worked perfectly (until I truncated the log). My mistake was then probably to think it was a supported use-case. There is not much litterature around ClusterBackup.

You have a scenario where the 3 node cluster has 2 failed nodes.

Actually, if I modify the test to have only one failed node (i.e. replace one true by false), it fails just the same.

Therefore you would need to fall back to manually fixing the system.

Yes, I could detect the absence of archive + cluster dirs in one node, wait a bit for other nodes to be ready to start, and automate the download of the state from another node before starting the cluster. Not trivial, but doable. Probably easier to update the fail over procedure so that data is copied manually into the extra nodes :)

we would not see having 3 replicated copies of the state as overkill.

Agree. I only meant that transmitting the same data 3 times over the network would not be optimal.

The premium Cluster Standby sure looks interesting!

from aeron.

mjpt777 commented on June 3, 2024

As @mikeb01 has pointed out this goes beyond the Raft algorithm. The reason the purge causes issues is that the leader must be log complete under the spec. Also consider without the purge the others nodes have to recovery the whole log. For a long running system this would not be practical as an alternative, even if it works in a simple test.

Snapshots are an optimisation but again there is very little explanation to how they are implemented in the Raft paper or PhD thesis. To purge a old log it needs to be coordinated and you need to know other nodes are up to date so you cannot introduce more than one failure in a 3 node system as Mike points out.

You can use a combination of Cluster Backup and some scripting to make this work. We have to make a living so we provide commercial support and a premium offering that makes this much easier. Many open core offerings do not even provide basic replication or fault tolerance in the open offering. We think we have gone pretty far with with we offer openly given the years of engineering effort that as gone into Aeron.

from aeron.

Cluster cannot start if some nodes lost their files and log has been truncated in remaining nodes about aeron HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent