Comments (18)
Hmm I suppose according to #12 you don't want to have all 20 nodes be fighting for leadership?
This feature will already help a lot.
And actually nothing can stop us from making this code run all connects in parallel. In this case "slowest" node will determine maximum execution time which should not exceed connect_timeout=3s + statement_timeout=2s ~ 5s
from patroni.
Connecting in parallel in an undifferentiated 20-node cluster would not be a great solution, IMHO. That's 400 connections being formed between members of the cluster in the space of a few seconds.
Yeah, per #12 in the 20-node cluster I have there's only going to actually be three designated failover replicas, so this won't be an actual operational issue once I implement #12. However, I can completely see other users setting up an undifferentiated pool of 20+ servers.
The "let's connect to all of the other replicas directly" also seems to me to be against the general design of patroni/governor. It bypasses the SIS and the REST API. If we're going to check each of the other replicas individually, at the least we should use the API instead of making a PostgreSQL connection. But that's besides the point, because I want the current recovery point pushed to the SIS for each replica regardless, because it's useful for monitoring. And if it's there, why not use it?
from patroni.
Also, there is nothing Patroni does which is more time-critical than failover. So attempting to connect to all of the other PostgreSQL servers directly is a delay we don't want for large clusters. We'll have to connect to some, but I'd like have a limit on the number regardless of the cluster size.
from patroni.
Here's the logic I'm picturing, assuming that check_all is false:
sort members.recovery_locations:
if am-highest-member:
connect to 2 next-highest members (API)
check that they are behind:
yes, seize master key
no, defer
check that they are not masters
if not-highest:
connect to highest (API), check healthy:
if yes, defer
if no, remove from list, pull fresh data and resort
The not-highest chain of logic is required in case timing issues cause the server which is actually ahead to think it's not ahead, or if the ahead-most replica is actually down as well as the master.
from patroni.
Hmmm. Still a problem; in that chain of logic, we still potentially have 18 replicas trying to connect to one highest replica in the same couple seconds. If the replicas are heavily loaded and near their connection limits, this could cause the highest to not respond.
This is a reason to go through the API, I think; we can have the API differentiate between "connection limit" and "down". If it returns "connection limit", then the other replicas retry in a loop.
The alternative is to have the API have some kind of data caching, but that seems like more of a pain.
from patroni.
Actually we have permanent connection to postgres from patroni and API is just using this connection.
So we should not have "connection limit" problem if we will go through API.
Personally I think that we should just go through API in parallel with timeout 2-3 seconds.
I understand your wish to write into SIS as much data as possible and as often as possible, but...
From my experience 3 member etcd cluster (2.0.13) was able to write only ~4-5 requests/sec with one thread. Sure if you have more threads it scales but have no idea how good. There are definitely some performance improvements in a 2.1 haven't tested it yet.
BTW, should we really mix different tasks, detailed monitoring and HA?
from patroni.
Well, the etcd write speed is easy to test. Also, we don't really need more writes than a few per second anyway; I was thinking of updating every 5-10s.
If we publish as-complete-as-possible replication data to the SIS, then we enable writing more sophisticated failover logic if required in the future, while adhering to the "autonomous nodes" design. This is largely the same information we need for monitoring, so that's convenient.
Having all of that information in one place then makes it really simple to write a monitoring client for general replication health, and in fact gives us hooks for both a CLI and a Web UI without needing separate engines. In fact, we could write a javascript-only client, no webserver required. I probably won't, just because my JS isn't that great, but who knows? Certainly simply polling the SIS is going to be MUCH faster for a client than having to poll the API of each of the individual servers.
Also, certain actions need to be taken partly via the SIS. Manual promotion, for one, because we need to both shut down the master AND notify all of the replicas that there is a chosen new master.
from patroni.
We did some simple testing using ApacheBench against a 5 node etcd-cluster running 2.1.3. On smaller documents < 1KB we were getting > 2000 writes per second.
from patroni.
@feikesteenbergen, you forgot to mention that we are using tmpfs as a data-storage for etcd.
With tmpfs is performing way faster then usual hard drives (even SSD). For us it's ok, because we are running etcd cluster on AWS in autoscaling group with 3 AZ. But if somebody runs cluster just in one datacenter I wouldn't recommend to use our approach.
from patroni.
All this is besides the point. If someone is having performance issues on etcd, there are ways to fix that. Working around theoretical possible perf issues in our code, and in the process creating predictable failure issues, is just dumb.
On 2015 9 8 08:50:41 PDT, Alexander Kukushkin [email protected] wrote:
@feikesteenbergen, you forgot to mention that we are using tmpfs as a
data-storage for etcd.
With tmpfs is performing way faster then usual hard drives (even SSD).
For us it's ok, because we are running etcd cluster on AWS in
autoscaling group with 3 AZ. But if somebody runs cluster just in one
datacenter I wouldn't recommend to use our approach.
Reply to this email directly or view it on GitHub:
#21 (comment)
Sent from my Android device with K-9 Mail. Please excuse my brevity.
from patroni.
@jberkus, you a right, it's not important.
Let's return to to original topic.
Sure, we can push replay_location for every node into SIS. It's not a big problem and very useful.
But master cant generate a lot of wal segments. For example some of our databases generating ~10GB/hour (or ~4MB/sec). Keeping in mind that each node pushes data into SIS at a different time do you really want to make some predictions in this case? And how good they would be?
Actually I did some performance testing of our API module. On my it can handle ~250 requests/sec. Here I am talking about real http/1.0 requests, i.e. without keep-alive.
So connecting to all nodes in parallel and fetching status in best case will take less then 0.1 sec. In worst case some requests will hit connection timeout.
And one more important question. What if somebody just killed patroni/governor with "-9". In this case postgres will stay alive. Should we consider this scenario when checking that former "master/leader" (according to SIS) is not reachable? I mean not patroni/governor, but the real postgres.
from patroni.
The above is why it's critical that the receive point write cycle be at least 3X as frequent as the checks for master timeout. That way, there will be a lag of at least a few seconds (and up to 30s) between the last master push and a replica polling the SIS for last receive point, so the location should be relatively accurate.
However, per my algo above, we can't eliminate all API checks based on SIS information. So if you feel strongly that we should poll the API instead ... and you believe the API will hold up under a burst of polling ... then I don't feel strongly enough about it to argue it right now (I might later if I have a larger cluster).
The API webserver is threaded though, no? Given that, won't it actually use a DB connection per concurrent request?
The kill -9 issue seems like a separate issue to me. Unless you believe there's some reason why the OOM killer would target Patroni, that falls under the heading of "protecting stupid users from themselves", something we really don't want to get into.
from patroni.
Also ... if Patroni is down, do we care if the DB server is still up? In any reasonable HAProxy config, if the API isn't responding to requests, it will route traffic away from that node.
from patroni.
And ... if the API is not threaded, then what happens if it gets a burst of requests to fast to answer? Do they queue, or do they return an error? If the latter, we could have problems.
from patroni.
Also ... if Patroni is down, do we care if the DB server is still up? In any reasonable HAProxy config, if the API isn't responding to requests, it will route traffic away from that node.
Got it, less work to us.
That way, there will be a lag of at least a few seconds (and up to 30s) between the last master push and a replica polling the SIS for last receive point, so the location should be relatively accurate.
You forgot about the case when master has been shutdown cleanly and the leader lock has been released voluntarily...
And ... if the API is not threaded, then what happens if it gets a burst of requests to fast to answer? Do they queue, or do they return an error? If the latter, we could have problems.
API is threaded, but it is using single connection to postgres. In some way requests would be queued because of that, but we are running only simple queries and usually overhead of python "threads" is a way higher...
20 parallel resuests:
$ ab -n 20 -c 20 http://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient).....done
Server Software: BaseHTTP/0.6
Server Hostname: 127.0.0.1
Server Port: 8008
Document Path: /
Document Length: 127 bytes
Concurrency Level: 20
Time taken for tests: 0.219 seconds
Complete requests: 20
Failed requests: 0
Write errors: 0
Total transferred: 5000 bytes
HTML transferred: 2540 bytes
Requests per second: 91.29 [#/sec] (mean)
Time per request: 219.088 [ms] (mean)
Time per request: 10.954 [ms] (mean, across all concurrent requests)
Transfer rate: 22.29 [Kbytes/sec] received
And 100 parallel requests
$ ab -n 100 -c 100 http://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient).....done
Server Software: BaseHTTP/0.6
Server Hostname: 127.0.0.1
Server Port: 8008
Document Path: /
Document Length: 127 bytes
Concurrency Level: 100
Time taken for tests: 0.206 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 25000 bytes
HTML transferred: 12700 bytes
Requests per second: 486.52 [#/sec] (mean)
Time per request: 205.543 [ms] (mean)
Time per request: 2.055 [ms] (mean, across all concurrent requests)
Transfer rate: 118.78 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 2
Processing: 4 11 27.7 7 204
Waiting: 4 10 27.7 6 203
Total: 4 11 27.9 7 205
Percentage of the requests served within a certain time (ms)
50% 7
66% 7
75% 8
80% 8
90% 10
95% 13
98% 205
99% 205
100% 205 (longest request)
Results are almost the same.
from patroni.
Did some tests with SSL:
20 parallel requests:
$ ab -n 20 -c 20 https://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient).....done
Server Software: BaseHTTP/0.3
Server Hostname: 127.0.0.1
Server Port: 8008
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 184 bytes
Concurrency Level: 20
Time taken for tests: 0.050 seconds
Complete requests: 20
Failed requests: 0
Total transferred: 6140 bytes
HTML transferred: 3680 bytes
Requests per second: 397.68 [#/sec] (mean)
Time per request: 50.292 [ms] (mean)
Time per request: 2.515 [ms] (mean, across all concurrent requests)
Transfer rate: 119.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 3 14 4.4 16 18
Processing: 1 1 0.3 1 2
Waiting: 0 1 0.3 1 2
Total: 4 15 4.4 17 19
Percentage of the requests served within a certain time (ms)
50% 17
66% 17
75% 18
80% 18
90% 18
95% 19
98% 19
99% 19
100% 19 (longest request)
And 100 parallel requests
$ ab -n 100 -c 100 https://127.0.0.1:8008/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 127.0.0.1 (be patient).....done
Server Software: BaseHTTP/0.3
Server Hostname: 127.0.0.1
Server Port: 8008
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 184 bytes
Concurrency Level: 100
Time taken for tests: 0.288 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 30700 bytes
HTML transferred: 18400 bytes
Requests per second: 347.22 [#/sec] (mean)
Time per request: 288.000 [ms] (mean)
Time per request: 2.880 [ms] (mean, across all concurrent requests)
Transfer rate: 104.10 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 18 3.1 19 28
Processing: 1 1 0.6 1 6
Waiting: 0 1 0.6 1 5
Total: 7 20 3.1 20 29
Percentage of the requests served within a certain time (ms)
50% 20
66% 21
75% 22
80% 22
90% 22
95% 24
98% 26
99% 29
100% 29 (longest request)
from patroni.
OK, seems like a plan then. Great!
from patroni.
Implemented by PR #32
from patroni.
Related Issues (20)
- Patroni overwrite synchronous_standby_names on primary in async mode
- Failsafe mode when master doesn't have access to DCS HOT 1
- TypeError: string argument without an encoding HOT 1
- Patroni Does Not Failover on Data Disk Full Shutdown HOT 3
- Missing cdiff in requirements HOT 2
- switchover pg cluster,but master not failover HOT 3
- Unable to deploy haproxy after deploy Patroni on K8s HOT 1
- [3.1.0] synchronous_mode updating synchronous_standby_names with leader node HOT 12
- recovery settings in postgresql parameters will cause recovery_conf ignored while building config HOT 1
- master postgres crashed and rejoin failed HOT 1
- 'psycopg2.extensions.connection' object has no attribute 'info'
- WAL Files are not deleted after 20 GB pg_restore on primary node HOT 3
- Patroni lost connection and restarted after restarting the etcd-Server HOT 3
- doing unnecessary crash recovery when primary_start_timeout=0 and failover is impossible HOT 1
- Patronictl edit-config throws an exception but updates the config file even when less and more are present on the host HOT 1
- FATAL: could not receive data from WAL stream: HOT 12
- Failover issue url: /patroni ('Connection aborted.', 'Connection reset by peer' HOT 1
- When patroni executes pg_basebackup, the maxfailures parameter is written as 2. Can this parameter be changed? HOT 7
- Incorrect host in .pgpass file for post_init script when using unix_socket_directories HOT 2
- Duplicate patroni.yml log_filename config lines can cause issues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from patroni.