trivadis / pgoperate Goto Github PK
View Code? Open in Web Editor NEWpgOperate - PostgreSQL Operation Tool
Home Page: https://github.com/Trivadis/pgoperate
License: Apache License 2.0
pgOperate - PostgreSQL Operation Tool
Home Page: https://github.com/Trivadis/pgoperate
License: Apache License 2.0
The command "pgoperate --standbymgr --check" works only from the current primary host. If I start it on the standby host it ends with an error even if ssh with keys works (see below). I think the tool wrongly mixes up the hostnames internally...
Execution on current primary host (works):
vmpg153rz43:/home/pg000235 [pg000235]$ pgoperate --standbymgr --status
Node_number Uniq_name Host Role Mode State WAL_receiver Apply_lag_MB Transfer_lag_MB Transfer_lag_Min
1 olt vmpg031rz44ha STANDBY async UP streaming 0 0 0
2 zhh vmpg153rz43ha MASTER UP
vmpg153rz43:/home/pg000235 [pg000235]$ pgoperate --standbymgr --check --host vmpg031rz44ha --master-host vmpg153rz43ha
Success.
Execution from current standby does not work:
vmpg031rz44:/home/pg000235 [pg000235]$ pgoperate --standbymgr --status
Node_number Uniq_name Host Role Mode State WAL_receiver Apply_lag_MB Transfer_lag_MB Transfer_lag_Min
1 olt vmpg031rz44ha STANDBY async UP streaming 0 0 0
2 zhh vmpg153rz43ha MASTER UP
vmpg031rz44:/home/pg000235 [pg000235]$ pgoperate --standbymgr --check --host vmpg153rz43ha --master-host vmpg153rz43ha
Check failed.
SSH works from standby to primary and vice versa.
Primary -> Standby:
vmpg153rz43:/home/pg000235 [pg000235]$ ssh vmpg031rz44ha 'hostname'
vmpg031rz44
Standby -> Primary:
vmpg031rz44:/home/pg000235 [pg000235]$ ssh vmpg153rz43ha 'hostname'
vmpg153rz43
Currently the pgoperate check module allows to include specific thresholds with the following parameter:
PG_CHECK_LOGFILES_THRESHOLD="ERROR|FATAL|PANIC"
However, it would be helpful to have an additional parameter to exclude specific patterns based on a regexp. Example:
PG_CHECK_LOGFILES_EXCLUDE_PATTERN="<regexp>"
This means the check searches first for the given thresholds and applies then the exclude pattern. Everything that remains afterwards will be treaten as issue.
Could you please add a hint in the documentation that pgoperated-postgres daemon might not start with SELinux enabled
My server is a RHEL 8.3 and uses SELinux mode "Enforcing". After installing pgoperate and executing root.sh script the pgoperated-postgres daemon was not able to start. I found out that it works after disabling SELinux.
However, this is not the best option and therefore I changed the context for the bin directories of pgbasenv and pgoperate.
Please add this information to the README.md.
# install semanage utility
yum install policycoreutils-python-utils
# Set and restore the context
semanage fcontext -a -t bin_t "/var/lib/pgsql/tvdtoolbox/pgbasenv/bin(/.*)?"
semanage fcontext -a -t bin_t "/var/lib/pgsql/tvdtoolbox/pgoperate/bin(/.*)?"
restorecon -R -v /var/lib/pgsql/tvdtoolbox/pgbasenv/bin
# Start of the daemon works now.
systemctl start pgoperated-postgres.service
# The new context is shown with -Z option of ls command
ls -lZ /var/lib/pgsql/tvdtoolbox/pgbasenv/bin
In version v4.2 the path for the file pgoperate-deamon.pid has changed from bin to run folder ($PGOPERATE_BASE/run). As a consequence the PIDFile path in /etc/systemd/system/pgoperated-.service file must be changed as well in case an upgrade from version v4.2 to a newer version is executed.
Otherwise, the pgoperated deamon cannot find the pid file and will restart the postgresql instance every 300 seconds as defined in the standard service file, added by root.sh.
PIDFile=/home/pg600047/tvdtoolbox/pgoperate/run/pgoperate-deamon.pid
Restart=on-failure
TimeoutSec=300
pgoperate --check
distinguishes currently between severities ok and critical. Additionally, it should support the severity warning.
Hi,
If in a cluster template in the PG_DEF_PARAMS section the archive_command parameter is set with the syntax of pgBackRest, the generated postgresql.conf is incorrect.
e.g. the template lines
PG_DEF_PARAMS="
....
archive_command = 'pgbackrest --stanza=pg15-db538c01 archive-push %p'
...
"
results in the postgresql.conf line
archive_command = 'pgbackrest stanza # Modified by PgOperate
So it seems, that special characters like minus char are "parsed away" and everything after the equal sign is also "gone with the wind".
Is there a way to fix this?
Best regards,
Marian
The pgoperate installer install_pgoperate.sh
must not overwrite the file lib/custom_check.lib during an upgrade since the user has there its custom check functions.
Therefore the installer must check if this file already exists and avoid overwriting it during an upgrade.
In case we try to prepare the master it fails if the hostname contains a dash, e.g. vmpg025rz44-ha.
pgoperate --standbymgr --prepare-master
Failed to create replication slot slot_vmpg025rz44-ha.
ERROR: Failed to create replication slot. Check the error message from psql. Fix the issue and try again.
The instance log shows the problem regarding the replication slot.
2021-09-23 13:27:35.966 UTC [9379] LOG: connection authorized: user=pg000130 database=postgres application_name=psql
2021-09-23 13:27:35.967 UTC [9379] ERROR: replication slot name "slot_vmpg025rz44-ha" contains invalid character
2021-09-23 13:27:35.967 UTC [9379] HINT: Replication slot names may only contain lower case letters, numbers, and the underscore character.
2021-09-23 13:27:35.967 UTC [9379] STATEMENT: SELECT * FROM pg_create_physical_replication_slot('slot_vmpg025rz44-ha')
Please remove dashes automatically from replication slot names.
extend input parameters to accept also "y" and "n" in addition to "yes" and "no"
e.g. remove cluster function
pgoperate --remove-cluster -a pgtest
Cluster pgtest will be deleted. Cluster base directory including $PGDATA will be removed. Continue? [yes/no]: yes
add a feature to optionally configure a VIP(virtual IP) for the standby configuration
VIP should be started automatically on the respective Master node/instance
I have a new server and the directory db/repconf_<cluster_name> does not exist. In this case the following error message appears. However, the standby setup succeeds but it is not so nice.
vmpg025rz44ha:/home/pg000130 [pg000130]$ pgoperate --standbymgr --add-standby --target vmpg156rz43ha
mkdir: cannot create directory โ/home/pg000130/tvdtoolbox/pgoperate/db/repconf_pg000130โ: No such file or directory
ls: cannot access /home/pg000130/tvdtoolbox/pgoperate/db/repconf_pg000130: No such file or directory
Standby created. Checking status ...
Replication status is 'streaming'. Successful.
Synchronizing config with all standbys.
Once the same standby is created a second time the error is not there anymore because the directory is there.
The argument --sync_config
of standbymgr.sh
is wrong (typo - instead of _) and must be named --sync-config
in $PGOPERATE_BASE/bin/pgoperated
.
Therefore, corresponding error messages can occur in syslog related to the wrongly spelled command.
I observed that after a switchover with pgoperate --standbymgr --switchover
that the obsolete files in $PGDATA/pg_wal are not deleted anymore and we can potentially run into a space issue.
The main issue is that the replication slot directory on the new standby is not removed and therefore the wal files are kept.
We have a fresh deployment with a primary and a standby where the primary has one replication slot.
### Primary
vmpg025rz44:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
NO
vmpg025rz44:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------. 3 pg000247 postgres 22 Nov 17 15:43 .
drwx------. 19 pg000247 postgres 4096 Nov 17 15:46 ..
drwx------. 2 pg000247 postgres 19 Nov 17 15:47 slot_zhh
### Standby
vmpg155rz43:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
YES
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------. 2 pg000247 postgres 6 Nov 17 15:43 .
drwx------. 19 pg000247 postgres 4096 Nov 17 15:46 ..
Then we do a switchover.
pgoperate --standbymgr --switchover --target zhh
Stopping master.
Getting master last checkpoint location and next XID.
waiting for server to promote.... done
server promoted
Replication slot slot_zhh created on vmpg155rz43.
Replication slot slot_olt created on vmpg155rz43.
Starting cluster pg000247.
WAL receiver in streaming mode.
Reinstation complete.
Executing sync on master zhh (vmpg155rz43ha)
Afterwards, the directory for the replication slot remains on the new standby. This prevents the standby from deleting the obsolete wal files!
### Standby
vmpg025rz44:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
YES
vmpg025rz44:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------. 3 pg000247 postgres 22 Nov 17 17:01 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
drwx------. 2 pg000247 postgres 19 Nov 17 17:06 slot_zhh
The new primary gots the new slot (slot_olt) as expected and kept the old one (slot_zhh). Here is the question if it is smart to keep the "old" replication slot. Not sure whether it should be deleted or not. We should discuss that.
### Primary
vmpg155rz43:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
NO
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------. 4 pg000247 postgres 38 Nov 17 17:07 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
drwx------. 2 pg000247 postgres 19 Nov 17 17:07 slot_olt
drwx------. 2 pg000247 postgres 19 Nov 17 17:06 slot_zhh
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
slot_zhh | | physical | | | f | f | | | | | | | | f
slot_olt | | physical | | | f | t | 15914 | | | 0/170011E8 | | reserved | | f
(2 rows)
Lets have a look at the wal logs.
### Primary
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_wal/
total 131096
drwx------. 3 pg000247 postgres 4096 Nov 17 17:10 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
-rw-------. 1 pg000247 postgres 41 Nov 17 15:48 00000002.history
-rw-------. 1 pg000247 postgres 83 Nov 17 15:54 00000003.history
-rw-------. 1 pg000247 postgres 126 Nov 17 17:06 00000004.history
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001A
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001B
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001C
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001D
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001E
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001F
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000020
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:06 000000040000000000000021
drwx------. 2 pg000247 postgres 4096 Nov 17 17:10 archive_status
### Standby
ll $PGDATA/pg_wal/
total 163868
drwx------. 3 pg000247 postgres 4096 Nov 17 17:10 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
-rw-------. 1 pg000247 postgres 371 Nov 17 15:48 000000010000000000000008.00000028.backup
-rw-------. 1 pg000247 postgres 41 Nov 17 15:48 00000002.history
-rw-------. 1 pg000247 postgres 83 Nov 17 15:54 00000003.history
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:06 000000030000000000000016
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:07 000000030000000000000017
-rw-------. 1 pg000247 postgres 126 Nov 17 17:07 00000004.history
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000017
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000018
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000019
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001A
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001B
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001C
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001D
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001E
drwx------. 2 pg000247 postgres 4096 Nov 17 17:10 archive_status
As we see the standby has wal logs which are actually not required anymore like:
000000030000000000000016
000000030000000000000017
...
000000040000000000000019
I did a manual checkpoint on both sites but the logs remain.
Manual solution/workaround
I removed the directory on the standby site, restarted it and executed a manual checkpoint on both sites.
### Standby
ls -l $PGDATA/pg_replslot
rm -rf $PGDATA/pg_replslot/slot_zhh
pgoperate --stop
pgoperate --start
psql -c "checkpoint;"
### Primary
psql -c "checkpoint;"
Finally, we can see that the obsolete logs have been deleted on the standby site as well.
### Primary
ll $PGDATA/pg_wal/
total 131092
drwx------. 3 pg000247 postgres 4096 Nov 17 17:14 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
-rw-------. 1 pg000247 postgres 41 Nov 17 15:48 00000002.history
-rw-------. 1 pg000247 postgres 83 Nov 17 15:54 00000003.history
-rw-------. 1 pg000247 postgres 126 Nov 17 17:06 00000004.history
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:14 00000004000000000000001E
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001F
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000020
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:06 000000040000000000000021
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000022
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000023
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000024
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000025
drwx------. 2 pg000247 postgres 93 Nov 17 17:14 archive_status
### Standby
ll $PGDATA/pg_wal/
total 163868
drwx------. 3 pg000247 postgres 4096 Nov 17 17:14 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:13 ..
-rw-------. 1 pg000247 postgres 371 Nov 17 15:48 000000010000000000000008.00000028.backup
-rw-------. 1 pg000247 postgres 41 Nov 17 15:48 00000002.history
-rw-------. 1 pg000247 postgres 83 Nov 17 15:54 00000003.history
-rw-------. 1 pg000247 postgres 126 Nov 17 17:07 00000004.history
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:14 00000004000000000000001E
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:06 00000004000000000000001F
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:07 000000040000000000000020
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000021
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000022
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000023
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000024
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000025
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000026
-rw-------. 1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000027
drwx------. 2 pg000247 postgres 4096 Nov 17 17:14 archive_status
add FAQ for pgoperate
with common questions , issues and so on regarding the configuration of pgoperate
Add a new parameter to the parameter file (template: parameters_mycls.conf.tpl) to define the WAL size used by initdb --wal-segsize=16.
Parameter name could be: PG_WAL_SEGSIZE.
Hello.
If the log_line_prefix parameter is defined in the cluster template file as a parameter to be set, this has no effect on the generated postgresql.conf file.
PG_DEF_PARAMS="
...
log_line_prefix = '%t [%p-%l]: user:%u,db:%d,app:%a,client:%h '
...
"
Results in postgresql.conf:
log_line_prefix = '%m [%p] ' # special values:
At first moment I thought of a similar problem as in Issue#27, however even simple assignments (e.g. log_line_prefix = 'easy ') are not taken over.
Best regards,
Marian
pgoperate adds/changes the parameter restore_command within various sub-commands. It uses currently the following value which cannot be changed:
restore_command = 'cp /u01/app/postgres/pg000235/backup/*/wal/%f %p || cp /u01/app/postgres/pg000235/arch/%f %p'
In case the pgoperate --backup/--restore is in use this works fine. However, if we use a third party backup tool and disable the pgoperate sub-commands --backup/--restore in the parameters file (DISABLE_BACKUP_SCRIPTS=yes) this static value will most likely not work anymore.
Therefore, the restore_command must become configurable in pgoperate. I propose to add a parameter in the parameter file (parameters_mycls.conf.tpl) called RESTORE_COMMAND. The default remains as it is now with the static value. But for example with pgbackrest we must set this parameter to:
restore_command = 'pgbackrest --stanza=pg000235 archive-get %f "%p"'
A quick grep showed that the following sub-commands are affected.
grep restore_command pgoperate/bin/* | awk '{ print $1 }' | awk -F : '{ print $1 }' | uniq
pgoperate/bin/create_slave.sh
pgoperate/bin/restore.sh
pgoperate/bin/standbymgr.sh
However, the restore.sh must probably not be changed because this sub-command cannot be used if DISABLE_BACKUP_SCRIPTS=yes.
pgoperate --restore
Current cluster: pg000235
INFO: Backup/recovery disabled by DISABLE_BACKUP_SCRIPTS variable.
move pid file path from $PGOPERATE_BASE/bin to $PGOPERATE_BASE/run
During the switchover an psql error message is shown on the STDOUT but the operation itself succeeds.
vmpg156rz43ha:/home/pg000131 [pg000131]$ pgoperate --standbymgr --switchover
Stopping master.
Getting master last checkpoint location and next XID.
waiting for server to promote.... done
server promoted
Replication slot slot_vmpg025rz44ha created on vmpg156rz43ha.
psql: error: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.49157"?
Starting cluster pg000131.
WAL receiver in streaming mode.
Reinstation complete.
Executing sync on master vmpg156rz43ha
vmpg156rz43ha:/home/pg000131 [pg000131]$
The following command seems to end up in /dev/null. No reload is executed at all and no log entry is visible.
pgoperate --reload
If I execute a reload with the PostgreSQL command itself I see it in the log.
SELECT pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
# Log entry
2021-05-27 09:04:27.275 CEST [28806] LOG: received SIGHUP, reloading configuration files
Currently, the --standbymgr subcommand uses the $HOSTNAME for the communication between master and standby. This must be configurable. Reason:
I have a host with three network interfaces (eth0/1/2). The hostname refers to eth0 but I want to communicate between master and standby via eth2. This cannot be configured properly at the moment, except changing the hostname.
would be great to have a quite clean postgresql.conf
remove all not needed entries from postgresql.conf and keep only
the needed parameters and settings (e.g have only 70 parameters in the file instead of 600)
keep the original file as postgressql.orig for reference
If the parameter file of an instance contains comments within the parameter PG_DEF_PARAMS the operation --create-cluster fails respectively the cluster cannot be started.
Example:
vi ~/tvdtoolbox/pgoperate/etc/parameters_pg000400.conf
PG_DEF_PARAMS="
<...>
# Some comment
#log_statement='ddl'
<...>
"
Create cluster fails:
pgoperate --create-cluster -a pg000400.conf
<...>
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
Reason: The following line can be found at the end of postgresql.conf.
= # Added by PgOperate
If I set the standby configuration to sync or async with the option --bidirectional it works only when I execute the command on the master.
pgoperate --standbymgr --set-sync --target vmpg155rz43ha --bidirectional
In case the command is executed on the standby the parameter synchronous_standby_names is only set on the master side but not on the standby. Means after a switchover operation the log transport is not sync anymore.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.