Comments (8)
I managed to reproduce this on a test cluster as follows:
3 machines in dc1
3 machines in dc2
consul-replicate running on a consul server in dc2 is replicating 5 different prefixes from dc1 to dc2
source = "pub@dc1"
source = "shared@dc1"
source = "priv@dc1"
source = "consul-template@dc1"
source = "event/input@dc1"
The pub prefix got ~120k records in it.
On a server in dc1 every 10 seconds I "touch" 9 keys and erase 1 random key
while true; do
cnt=0
consul kv get -recurse pub | shuf | head -n 10 | sed 's/:/ /' | while read key value; do
if [[ ${cnt} == 9 ]]; then
consul kv delete "${key}"
else
consul kv put "${key}" "${value}"
fi
cnt=$((${cnt}+1))
done
sleep 10
done
In a separate console i am sending SIGKILL to all consul servers in dc1 which are immediately started again by the "perp" supervisor
while true; do
for i in sof1 sof2 sof3; do
ssh ${i} 'perpctl k consul' &
done
sleep 30
done
After several passes I see how all keys under the pub/ prefix are deleted:
2018-03-19 12:12:55.432404 2018/03/19 12:12:55.432397 [INFO] (runner) replicated 0 updates, 127739 deletes
The interesting thing is that the data from the other prefixes is received and shown as already replicated thus not marked for deletion:
2018-03-19 12:08:33.180186 2018/03/19 12:08:33.180182 [DEBUG] (runner) skipping because "event/input/cc77a17c-5e48-4614-995f-470dd5f4ed0b" is already replicated
However it appears that consul-replicate is receiving blank data from the "pub" prefix watch and it is thinking that all records were erased from the master dc. As a result they are removed from the following dc as well.
Not sure if this is problem with consul-replicate not waiting enough for the master dc cluster to reach consistent state and triggering/receiving incomplete keyprefix watch data or it is problem with the consul itself which under certain unknown conditions is sending blank data in response to keyprefix watches.
Still not sure how blank data for our pub prefix got past this check as it appears to be per-prefix:
// Get the last status
status, err := r.tStatus(prefix)
if err != nil {
errCh <- fmt.Errorf("failed to read replication status: %s", err)
return
}
Any assistance will be highly appreciated.
from consul-replicate.
Still not sure if this is a consul issue or consul-replicate issue.
I checked consul changelog and find out the following:
Do you think it might be related?
Currently I am in process of upgrading my test cluster to the latest consul version so I can see if upgrading it will make this issue go away. Still not sure if consul or consul-replicate issue.
Any assistance will be highly appreciated.
from consul-replicate.
Quick update on the issue.
This case has been proven to be present even with the latest consul version available 1.0.6 and raft version 1
Steps to reproduce
- create 3 servers in dc1
- create 3 servers in dc2
- start consul-replicate of selected prefix from dc1 to dc2
- push ~130000-~150000 kv records in the prefix
- start randomly reading 7 and erasing 3 records from the prefix each 10 seconds
- during the same time issue restart forcefully consul on all 3 nodes in the master dc1 each 30 seconds
- test started 1521549485
- data loss in DC2 at 1521553967
2018-03-20 13:52:44.040452 2018/03/20 13:52:44.040433 [INFO] (runner) replicated 0 updates, 127118 deletes
from consul-replicate.
FIxed by hashicorp/consul#4554
from consul-replicate.
@pierresouchay We are facing similar issue with consul 1.4.0 . After the upgrade, the consul-replicate started deleting the keys once in a while. We run consul-replicate along with one of the consul server. We're trying to replicate the issue as of now.
from consul-replicate.
@vaLski, just wanted to say 'thanks' for this thoroughly documented issue. It really helped us out.
from consul-replicate.
@arecker Have you experienced the same issue again? It is supposed to be fixed in hashicorp/consul#4554. Since we upgraded our consul servers to 1.2.3
release which carry the mentioned patch, the problem went away. Thanks to @pierresouchay for the fix and the assistance provided while tracking this down.
@nitsh Did you find the reason for the problem and solution for it? Kindly share further details if so. If you are still facing this issue kindly run consul-replicate in trace mode and attach log snippets where we can see tracing information of errors, debugging info while deleting data etc. In my personal case I was seeing some 5xx class errors in the consul-replicate logs during master-dc leader election and shortly afterwards deleting XXXXXXX keys as consul-replicate was configured with the stale
flag set and it was syncing data from a raft-follower whose raft log? (data) was empty. But that is supposed to be fixed already. However since then, consul-replicate in our setup is configured to NEVER
use the stale
flag but rather to always query the raft leader in the parent DC.
from consul-replicate.
from consul-replicate.
Related Issues (20)
- Project status?
- Replication stops working for a particular k/v secrets
- Photo editing
- photo
- support streaming backend
- why not add feature for registering consul ? we need to watch whether consul-replicate is running
- The status_dir configuration appears to be ignored when read from file HOT 2
- This service requires some kind of basic HTTP health check
- Is consul-replicate compatible with consul-esm?
- Consul replicate mandates wan links
- No way to force HTTP HOT 2
- Unexpected response code: 500 HOT 2
- Consul Connect Intentions replication using consul-replicate HOT 5
- vault + consul cross DC setup HOT 1
- safeService
- TLS 1.3 support
- How To replicate multiple Dcs
- performance degrades as watched prefix becomes heavier
- consul-replicate overwrites keys with the same prefix in the destination data center as the source HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from consul-replicate.