Giter Site home page Giter Site logo

Comments (3)

ongardie avatar ongardie commented on June 9, 2024

Hi @mmarod, it's been a long time since I've looked at this, but I think that's true. I think the idea was that you probably wouldn't want to reconfigure into an already degraded cluster, as that might be a mistake in specifying the new configuration.
If you had additional unexpected failures during the reconfiguration, that could also get pretty messy. Theoretically only a bare majority of old servers and a bare majority of new servers needs to be available during a reconfiguration, but in practice you probably do want a bit more wiggle room to tolerate failures during reconfiguration.

In your example, if node1 isn't functioning and you want to reconfigure the cluster, you should probably remove node1 during that reconfiguration. You can replace it with a new node that is available, like node4. Of course, I can imagine wanting to have different behavior depending on your operational requirements. Are you just asking out of curiosity or are you actually using LogCabin for something? (The project isn't actively maintained these days.)

Since you linked to the dissertation, I should mention that LogCabin uses the joint consensus membership change algorithm described in section 4.3 there. Most of the rest of the concepts of the chapter do still apply or transfer over, but I just wanted to clarify.

Additionally, it also looks like step 2 of the AddServer RPC is not being enforced. The routine is checking that all of _newServers are caught up but not the candidate specifically. This means that if we simply changed the check to be a quorum of staging servers, it would not guarantee that the new server is caught up.

I'm a little confused because the AddServer RPC is described in the dissertation for the single-server membership change algorithm (not the joint consensus algorithm). What is "the candidate" in your question? A "quorum of staging servers" also seems a little sloppy (at least in an imaginary world where servers can be staging for various reasons). You probably mean a quorum of the new servers (a majority of the servers in the new target configuration).

Perhaps you'd want to change the check so that all new servers are up, but if a server was already part of the cluster, it doesn't need to be caught up. I'm not convinced this is better than the current approach, though (at least without a real-world use case).

from logcabin.

mmarod avatar mmarod commented on June 9, 2024

First off thanks for the quick and thorough response for a question on an "unmaintained" project!

Taking the example -- I think you are right that it would make sense to remove node1 when node4 is added. My company's software that is using LogCabin only supplies initial bootstrap, add, and remove APIs. Perhaps it should also have a "replace" API which would seemingly get around this issue -- or to do a remove and then an add. The problem is that the software was initially designed with the assumption that if a member goes down, it will come back up at some point (ie: node1 will at some point come back online). This assumption does not necessarily hold in a Cloud environment however as node1 could be gone and lost forever (depending on the implementation of course) with node4 coming up as a replacement. So, when node1 goes down forever, adding node4 becomes impossible without removing node1 first.

A "quorum of staging servers" also seems a little sloppy (at least in an imaginary world where servers can be staging for various reasons).

Indeed -- I meant new servers specifically here.

Perhaps you'd want to change the check so that all new servers are up, but if a server was already part of the cluster, it doesn't need to be caught up. I'm not convinced this is better than the current approach, though (at least without a real-world use case).

In our specific case, node1 is never coming back so this wouldn't work. The workaround I was able to come up with checks:

  1. That any servers in new servers, but not old servers, are caught up.
  2. That a majority of new servers are caught up and online.

I also verified that if I brought node1 back online it caught up and membership was accurate.

from logcabin.

mmarod avatar mmarod commented on June 9, 2024

Going to close this out -- thanks for the help

from logcabin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.