Giter Site home page Giter Site logo

Comments (8)

vaLski avatar vaLski commented on June 2, 2024

I managed to reproduce this on a test cluster as follows:

3 machines in dc1
3 machines in dc2

consul-replicate running on a consul server in dc2 is replicating 5 different prefixes from dc1 to dc2

source = "pub@dc1"
source = "shared@dc1"
source = "priv@dc1"
source = "consul-template@dc1"
source = "event/input@dc1"

The pub prefix got ~120k records in it.

On a server in dc1 every 10 seconds I "touch" 9 keys and erase 1 random key

while true; do
  cnt=0
  consul kv get -recurse pub | shuf | head  -n 10  | sed  's/:/ /'  | while read key value;  do
    if [[ ${cnt} == 9 ]];  then
      consul kv delete "${key}"
    else
      consul kv put "${key}" "${value}"
    fi
    cnt=$((${cnt}+1))
  done
  sleep 10
done

In a separate console i am sending SIGKILL to all consul servers in dc1 which are immediately started again by the "perp" supervisor

while true; do
  for i in sof1 sof2 sof3; do
    ssh ${i} 'perpctl k consul' &
  done
  sleep 30
done

After several passes I see how all keys under the pub/ prefix are deleted:

2018-03-19 12:12:55.432404 2018/03/19 12:12:55.432397 [INFO] (runner) replicated 0 updates, 127739 deletes

The interesting thing is that the data from the other prefixes is received and shown as already replicated thus not marked for deletion:

2018-03-19 12:08:33.180186 2018/03/19 12:08:33.180182 [DEBUG] (runner) skipping because "event/input/cc77a17c-5e48-4614-995f-470dd5f4ed0b" is already replicated

However it appears that consul-replicate is receiving blank data from the "pub" prefix watch and it is thinking that all records were erased from the master dc. As a result they are removed from the following dc as well.

Not sure if this is problem with consul-replicate not waiting enough for the master dc cluster to reach consistent state and triggering/receiving incomplete keyprefix watch data or it is problem with the consul itself which under certain unknown conditions is sending blank data in response to keyprefix watches.

Still not sure how blank data for our pub prefix got past this check as it appears to be per-prefix:

// Get the last status
status, err := r.tStatus(prefix)
if err != nil {
    errCh <- fmt.Errorf("failed to read replication status: %s", err)
    return
}

Any assistance will be highly appreciated.

from consul-replicate.

vaLski avatar vaLski commented on June 2, 2024

Still not sure if this is a consul issue or consul-replicate issue.

I checked consul changelog and find out the following:

hashicorp/consul#2644

Do you think it might be related?

Currently I am in process of upgrading my test cluster to the latest consul version so I can see if upgrading it will make this issue go away. Still not sure if consul or consul-replicate issue.

Any assistance will be highly appreciated.

from consul-replicate.

vaLski avatar vaLski commented on June 2, 2024

Quick update on the issue.

This case has been proven to be present even with the latest consul version available 1.0.6 and raft version 1

Steps to reproduce

  • create 3 servers in dc1
  • create 3 servers in dc2
  • start consul-replicate of selected prefix from dc1 to dc2
  • push ~130000-~150000 kv records in the prefix
  • start randomly reading 7 and erasing 3 records from the prefix each 10 seconds
  • during the same time issue restart forcefully consul on all 3 nodes in the master dc1 each 30 seconds
  • test started 1521549485
  • data loss in DC2 at 1521553967
    2018-03-20 13:52:44.040452 2018/03/20 13:52:44.040433 [INFO] (runner) replicated 0 updates, 127118 deletes

from consul-replicate.

pierresouchay avatar pierresouchay commented on June 2, 2024

FIxed by hashicorp/consul#4554

from consul-replicate.

nitsh avatar nitsh commented on June 2, 2024

@pierresouchay We are facing similar issue with consul 1.4.0 . After the upgrade, the consul-replicate started deleting the keys once in a while. We run consul-replicate along with one of the consul server. We're trying to replicate the issue as of now.

from consul-replicate.

arecker avatar arecker commented on June 2, 2024

@vaLski, just wanted to say 'thanks' for this thoroughly documented issue. It really helped us out.

from consul-replicate.

vaLski avatar vaLski commented on June 2, 2024

@arecker Have you experienced the same issue again? It is supposed to be fixed in hashicorp/consul#4554. Since we upgraded our consul servers to 1.2.3 release which carry the mentioned patch, the problem went away. Thanks to @pierresouchay for the fix and the assistance provided while tracking this down.

@nitsh Did you find the reason for the problem and solution for it? Kindly share further details if so. If you are still facing this issue kindly run consul-replicate in trace mode and attach log snippets where we can see tracing information of errors, debugging info while deleting data etc. In my personal case I was seeing some 5xx class errors in the consul-replicate logs during master-dc leader election and shortly afterwards deleting XXXXXXX keys as consul-replicate was configured with the stale flag set and it was syncing data from a raft-follower whose raft log? (data) was empty. But that is supposed to be fixed already. However since then, consul-replicate in our setup is configured to NEVER use the stale flag but rather to always query the raft leader in the parent DC.

from consul-replicate.

arecker avatar arecker commented on June 2, 2024

from consul-replicate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.