If you try to add a new member to a Raft cluster that has a down member, and the curre

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Adding a new member requires all original members to be up and in sync instead of a consensus about logcabin HOT 3 CLOSED

mmarod commented on June 9, 2024

Adding a new member requires all original members to be up and in sync instead of a consensus

from logcabin.

Comments (3)

ongardie commented on June 9, 2024

Hi @mmarod, it's been a long time since I've looked at this, but I think that's true. I think the idea was that you probably wouldn't want to reconfigure into an already degraded cluster, as that might be a mistake in specifying the new configuration.
If you had additional unexpected failures during the reconfiguration, that could also get pretty messy. Theoretically only a bare majority of old servers and a bare majority of new servers needs to be available during a reconfiguration, but in practice you probably do want a bit more wiggle room to tolerate failures during reconfiguration.

In your example, if node1 isn't functioning and you want to reconfigure the cluster, you should probably remove node1 during that reconfiguration. You can replace it with a new node that is available, like node4. Of course, I can imagine wanting to have different behavior depending on your operational requirements. Are you just asking out of curiosity or are you actually using LogCabin for something? (The project isn't actively maintained these days.)

Since you linked to the dissertation, I should mention that LogCabin uses the joint consensus membership change algorithm described in section 4.3 there. Most of the rest of the concepts of the chapter do still apply or transfer over, but I just wanted to clarify.

Additionally, it also looks like step 2 of the AddServer RPC is not being enforced. The routine is checking that all of _newServers are caught up but not the candidate specifically. This means that if we simply changed the check to be a quorum of staging servers, it would not guarantee that the new server is caught up.

I'm a little confused because the AddServer RPC is described in the dissertation for the single-server membership change algorithm (not the joint consensus algorithm). What is "the candidate" in your question? A "quorum of staging servers" also seems a little sloppy (at least in an imaginary world where servers can be staging for various reasons). You probably mean a quorum of the new servers (a majority of the servers in the new target configuration).

Perhaps you'd want to change the check so that all new servers are up, but if a server was already part of the cluster, it doesn't need to be caught up. I'm not convinced this is better than the current approach, though (at least without a real-world use case).

from logcabin.

mmarod commented on June 9, 2024

First off thanks for the quick and thorough response for a question on an "unmaintained" project!

Taking the example -- I think you are right that it would make sense to remove node1 when node4 is added. My company's software that is using LogCabin only supplies initial bootstrap, add, and remove APIs. Perhaps it should also have a "replace" API which would seemingly get around this issue -- or to do a remove and then an add. The problem is that the software was initially designed with the assumption that if a member goes down, it will come back up at some point (ie: node1 will at some point come back online). This assumption does not necessarily hold in a Cloud environment however as node1 could be gone and lost forever (depending on the implementation of course) with node4 coming up as a replacement. So, when node1 goes down forever, adding node4 becomes impossible without removing node1 first.

A "quorum of staging servers" also seems a little sloppy (at least in an imaginary world where servers can be staging for various reasons).

Indeed -- I meant new servers specifically here.

Perhaps you'd want to change the check so that all new servers are up, but if a server was already part of the cluster, it doesn't need to be caught up. I'm not convinced this is better than the current approach, though (at least without a real-world use case).

In our specific case, node1 is never coming back so this wouldn't work. The workaround I was able to come up with checks:

That any servers in new servers, but not old servers, are caught up.
That a majority of new servers are caught up and online.

I also verified that if I brought node1 back online it caught up and membership was accurate.

from logcabin.

mmarod commented on June 9, 2024

Going to close this out -- thanks for the help

from logcabin.

Adding a new member requires all original members to be up and in sync instead of a consensus about logcabin HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent