Currently, is_healthiest_node() involves each replica connecting to every other replic

Hmm I suppose according to <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Remove connections to all other replicas when deciding failover? about patroni HOT 18 CLOSED

jberkus commented on August 28, 2024

Remove connections to all other replicas when deciding failover?

from patroni.

Comments (18)

CyberDem0n commented on August 28, 2024

Hmm I suppose according to #12 you don't want to have all 20 nodes be fighting for leadership?

This feature will already help a lot.
And actually nothing can stop us from making this code run all connects in parallel. In this case "slowest" node will determine maximum execution time which should not exceed connect_timeout=3s + statement_timeout=2s ~ 5s

from patroni.

jberkus commented on August 28, 2024

Connecting in parallel in an undifferentiated 20-node cluster would not be a great solution, IMHO. That's 400 connections being formed between members of the cluster in the space of a few seconds.

Yeah, per #12 in the 20-node cluster I have there's only going to actually be three designated failover replicas, so this won't be an actual operational issue once I implement #12. However, I can completely see other users setting up an undifferentiated pool of 20+ servers.

The "let's connect to all of the other replicas directly" also seems to me to be against the general design of patroni/governor. It bypasses the SIS and the REST API. If we're going to check each of the other replicas individually, at the least we should use the API instead of making a PostgreSQL connection. But that's besides the point, because I want the current recovery point pushed to the SIS for each replica regardless, because it's useful for monitoring. And if it's there, why not use it?

from patroni.

jberkus commented on August 28, 2024

Also, there is nothing Patroni does which is more time-critical than failover. So attempting to connect to all of the other PostgreSQL servers directly is a delay we don't want for large clusters. We'll have to connect to some, but I'd like have a limit on the number regardless of the cluster size.

from patroni.

jberkus commented on August 28, 2024

Here's the logic I'm picturing, assuming that check_all is false:

sort members.recovery_locations:
     if am-highest-member:
         connect to 2 next-highest members (API)
         check that they are behind:
                 yes, seize master key
                 no, defer
         check that they are not masters
    if not-highest:
         connect to highest (API), check healthy:
               if yes, defer
               if no, remove from list, pull fresh data and resort

The not-highest chain of logic is required in case timing issues cause the server which is actually ahead to think it's not ahead, or if the ahead-most replica is actually down as well as the master.

from patroni.

jberkus commented on August 28, 2024

Hmmm. Still a problem; in that chain of logic, we still potentially have 18 replicas trying to connect to one highest replica in the same couple seconds. If the replicas are heavily loaded and near their connection limits, this could cause the highest to not respond.

This is a reason to go through the API, I think; we can have the API differentiate between "connection limit" and "down". If it returns "connection limit", then the other replicas retry in a loop.

The alternative is to have the API have some kind of data caching, but that seems like more of a pain.

from patroni.

CyberDem0n commented on August 28, 2024

Actually we have permanent connection to postgres from patroni and API is just using this connection.
So we should not have "connection limit" problem if we will go through API.

Personally I think that we should just go through API in parallel with timeout 2-3 seconds.

I understand your wish to write into SIS as much data as possible and as often as possible, but...
From my experience 3 member etcd cluster (2.0.13) was able to write only ~4-5 requests/sec with one thread. Sure if you have more threads it scales but have no idea how good. There are definitely some performance improvements in a 2.1 haven't tested it yet.

BTW, should we really mix different tasks, detailed monitoring and HA?

from patroni.

jberkus commented on August 28, 2024

Well, the etcd write speed is easy to test. Also, we don't really need more writes than a few per second anyway; I was thinking of updating every 5-10s.

If we publish as-complete-as-possible replication data to the SIS, then we enable writing more sophisticated failover logic if required in the future, while adhering to the "autonomous nodes" design. This is largely the same information we need for monitoring, so that's convenient.

Having all of that information in one place then makes it really simple to write a monitoring client for general replication health, and in fact gives us hooks for both a CLI and a Web UI without needing separate engines. In fact, we could write a javascript-only client, no webserver required. I probably won't, just because my JS isn't that great, but who knows? Certainly simply polling the SIS is going to be MUCH faster for a client than having to poll the API of each of the individual servers.

Also, certain actions need to be taken partly via the SIS. Manual promotion, for one, because we need to both shut down the master AND notify all of the replicas that there is a chosen new master.

from patroni.

feikesteenbergen commented on August 28, 2024

We did some simple testing using ApacheBench against a 5 node etcd-cluster running 2.1.3. On smaller documents < 1KB we were getting > 2000 writes per second.

from patroni.

CyberDem0n commented on August 28, 2024

@feikesteenbergen, you forgot to mention that we are using tmpfs as a data-storage for etcd.
With tmpfs is performing way faster then usual hard drives (even SSD). For us it's ok, because we are running etcd cluster on AWS in autoscaling group with 3 AZ. But if somebody runs cluster just in one datacenter I wouldn't recommend to use our approach.

from patroni.

jberkus commented on August 28, 2024

All this is besides the point. If someone is having performance issues on etcd, there are ways to fix that. Working around theoretical possible perf issues in our code, and in the process creating predictable failure issues, is just dumb.

On 2015 9 8 08:50:41 PDT, Alexander Kukushkin [email protected] wrote:

@feikesteenbergen, you forgot to mention that we are using tmpfs as a
data-storage for etcd.
With tmpfs is performing way faster then usual hard drives (even SSD).
For us it's ok, because we are running etcd cluster on AWS in
autoscaling group with 3 AZ. But if somebody runs cluster just in one
datacenter I wouldn't recommend to use our approach.

Reply to this email directly or view it on GitHub:
#21 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

from patroni.

CyberDem0n commented on August 28, 2024

@jberkus, you a right, it's not important.

Let's return to to original topic.
Sure, we can push replay_location for every node into SIS. It's not a big problem and very useful.
But master cant generate a lot of wal segments. For example some of our databases generating ~10GB/hour (or ~4MB/sec). Keeping in mind that each node pushes data into SIS at a different time do you really want to make some predictions in this case? And how good they would be?

Actually I did some performance testing of our API module. On my it can handle ~250 requests/sec. Here I am talking about real http/1.0 requests, i.e. without keep-alive.
So connecting to all nodes in parallel and fetching status in best case will take less then 0.1 sec. In worst case some requests will hit connection timeout.

And one more important question. What if somebody just killed patroni/governor with "-9". In this case postgres will stay alive. Should we consider this scenario when checking that former "master/leader" (according to SIS) is not reachable? I mean not patroni/governor, but the real postgres.

from patroni.

jberkus commented on August 28, 2024

The above is why it's critical that the receive point write cycle be at least 3X as frequent as the checks for master timeout. That way, there will be a lag of at least a few seconds (and up to 30s) between the last master push and a replica polling the SIS for last receive point, so the location should be relatively accurate.

However, per my algo above, we can't eliminate all API checks based on SIS information. So if you feel strongly that we should poll the API instead ... and you believe the API will hold up under a burst of polling ... then I don't feel strongly enough about it to argue it right now (I might later if I have a larger cluster).

The API webserver is threaded though, no? Given that, won't it actually use a DB connection per concurrent request?

The kill -9 issue seems like a separate issue to me. Unless you believe there's some reason why the OOM killer would target Patroni, that falls under the heading of "protecting stupid users from themselves", something we really don't want to get into.

from patroni.

jberkus commented on August 28, 2024

Also ... if Patroni is down, do we care if the DB server is still up? In any reasonable HAProxy config, if the API isn't responding to requests, it will route traffic away from that node.

from patroni.

jberkus commented on August 28, 2024

And ... if the API is not threaded, then what happens if it gets a burst of requests to fast to answer? Do they queue, or do they return an error? If the latter, we could have problems.

from patroni.

CyberDem0n commented on August 28, 2024

Also ... if Patroni is down, do we care if the DB server is still up? In any reasonable HAProxy config, if the API isn't responding to requests, it will route traffic away from that node.

Got it, less work to us.

That way, there will be a lag of at least a few seconds (and up to 30s) between the last master push and a replica polling the SIS for last receive point, so the location should be relatively accurate.

You forgot about the case when master has been shutdown cleanly and the leader lock has been released voluntarily...

And ... if the API is not threaded, then what happens if it gets a burst of requests to fast to answer? Do they queue, or do they return an error? If the latter, we could have problems.

API is threaded, but it is using single connection to postgres. In some way requests would be queued because of that, but we are running only simple queries and usually overhead of python "threads" is a way higher...

20 parallel resuests:

$ ab -n 20 -c 20 http://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done


Server Software:        BaseHTTP/0.6
Server Hostname:        127.0.0.1
Server Port:            8008

Document Path:          /
Document Length:        127 bytes

Concurrency Level:      20
Time taken for tests:   0.219 seconds
Complete requests:      20
Failed requests:        0
Write errors:           0
Total transferred:      5000 bytes
HTML transferred:       2540 bytes
Requests per second:    91.29 [#/sec] (mean)
Time per request:       219.088 [ms] (mean)
Time per request:       10.954 [ms] (mean, across all concurrent requests)
Transfer rate:          22.29 [Kbytes/sec] received

And 100 parallel requests

$ ab -n 100 -c 100 http://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done


Server Software:        BaseHTTP/0.6
Server Hostname:        127.0.0.1
Server Port:            8008

Document Path:          /
Document Length:        127 bytes

Concurrency Level:      100
Time taken for tests:   0.206 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      25000 bytes
HTML transferred:       12700 bytes
Requests per second:    486.52 [#/sec] (mean)
Time per request:       205.543 [ms] (mean)
Time per request:       2.055 [ms] (mean, across all concurrent requests)
Transfer rate:          118.78 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0       2
Processing:     4   11  27.7      7     204
Waiting:        4   10  27.7      6     203
Total:          4   11  27.9      7     205

Percentage of the requests served within a certain time (ms)
  50%      7
  66%      7
  75%      8
  80%      8
  90%     10
  95%     13
  98%    205
  99%    205
 100%    205 (longest request)

Results are almost the same.

from patroni.

CyberDem0n commented on August 28, 2024

Did some tests with SSL:

20 parallel requests:

$ ab -n 20 -c 20 https://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done


Server Software:        BaseHTTP/0.3
Server Hostname:        127.0.0.1
Server Port:            8008
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256

Document Path:          /
Document Length:        184 bytes

Concurrency Level:      20
Time taken for tests:   0.050 seconds
Complete requests:      20
Failed requests:        0
Total transferred:      6140 bytes
HTML transferred:       3680 bytes
Requests per second:    397.68 [#/sec] (mean)
Time per request:       50.292 [ms] (mean)
Time per request:       2.515 [ms] (mean, across all concurrent requests)
Transfer rate:          119.23 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        3   14   4.4     16      18
Processing:     1    1   0.3      1       2
Waiting:        0    1   0.3      1       2
Total:          4   15   4.4     17      19

Percentage of the requests served within a certain time (ms)
  50%     17
  66%     17
  75%     18
  80%     18
  90%     18
  95%     19
  98%     19
  99%     19
 100%     19 (longest request)

And 100 parallel requests

$ ab -n 100 -c 100 https://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient).....done


Server Software:        BaseHTTP/0.3
Server Hostname:        127.0.0.1
Server Port:            8008
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256

Document Path:          /
Document Length:        184 bytes

Concurrency Level:      100
Time taken for tests:   0.288 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      30700 bytes
HTML transferred:       18400 bytes
Requests per second:    347.22 [#/sec] (mean)
Time per request:       288.000 [ms] (mean)
Time per request:       2.880 [ms] (mean, across all concurrent requests)
Transfer rate:          104.10 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5   18   3.1     19      28
Processing:     1    1   0.6      1       6
Waiting:        0    1   0.6      1       5
Total:          7   20   3.1     20      29

Percentage of the requests served within a certain time (ms)
  50%     20
  66%     21
  75%     22
  80%     22
  90%     22
  95%     24
  98%     26
  99%     29
 100%     29 (longest request)

from patroni.

jberkus commented on August 28, 2024

OK, seems like a plan then. Great!

from patroni.

feikesteenbergen commented on August 28, 2024

Implemented by PR #32

from patroni.

Remove connections to all other replicas when deciding failover? about patroni HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent