trivadis / pgoperate Goto Github PK

View Code? Open in Web Editor NEW

9.0 7.0 3.0 499 KB

pgOperate - PostgreSQL Operation Tool

Home Page: https://github.com/Trivadis/pgoperate

License: Apache License 2.0

Shell 96.30% Smarty 3.70%

postgresql enterprisedb postgresql-tool

pgoperate's People

Contributors

Stargazers

Watchers

Forkers

cyrmue redneckdba moench-tegeder

pgoperate's Issues

"pgoperate --standbymgr --check" command does not work from standby host

The command "pgoperate --standbymgr --check" works only from the current primary host. If I start it on the standby host it ends with an error even if ssh with keys works (see below). I think the tool wrongly mixes up the hostnames internally...

Execution on current primary host (works):

vmpg153rz43:/home/pg000235 [pg000235]$ pgoperate --standbymgr --status
Node_number Uniq_name  Host                 Role                 Mode State WAL_receiver Apply_lag_MB Transfer_lag_MB Transfer_lag_Min
1           olt        vmpg031rz44ha        STANDBY             async UP    streaming             0               0                0
2           zhh        vmpg153rz43ha        MASTER                    UP
vmpg153rz43:/home/pg000235 [pg000235]$ pgoperate --standbymgr --check --host vmpg031rz44ha --master-host vmpg153rz43ha
Success.

Execution from current standby does not work:

vmpg031rz44:/home/pg000235 [pg000235]$ pgoperate --standbymgr --status
Node_number Uniq_name  Host                 Role                 Mode State WAL_receiver Apply_lag_MB Transfer_lag_MB Transfer_lag_Min
1           olt        vmpg031rz44ha        STANDBY             async UP    streaming             0               0                0
2           zhh        vmpg153rz43ha        MASTER                    UP
vmpg031rz44:/home/pg000235 [pg000235]$ pgoperate --standbymgr --check --host vmpg153rz43ha --master-host vmpg153rz43ha
Check failed.

SSH works from standby to primary and vice versa.
Primary -> Standby:

vmpg153rz43:/home/pg000235 [pg000235]$ ssh vmpg031rz44ha 'hostname'
vmpg031rz44

Standby -> Primary:

vmpg031rz44:/home/pg000235 [pg000235]$ ssh vmpg153rz43ha 'hostname'
vmpg153rz43

Add exclude regexp filters for pgoperate check PG_CHECK_LOGFILES

Currently the pgoperate check module allows to include specific thresholds with the following parameter:

PG_CHECK_LOGFILES_THRESHOLD="ERROR|FATAL|PANIC"

However, it would be helpful to have an additional parameter to exclude specific patterns based on a regexp. Example:
PG_CHECK_LOGFILES_EXCLUDE_PATTERN="<regexp>"

This means the check searches first for the given thresholds and applies then the exclude pattern. Everything that remains afterwards will be treaten as issue.

Add a hint in the documentation regarding pgoperated-postgres daemon and SELinux

Could you please add a hint in the documentation that pgoperated-postgres daemon might not start with SELinux enabled
My server is a RHEL 8.3 and uses SELinux mode "Enforcing". After installing pgoperate and executing root.sh script the pgoperated-postgres daemon was not able to start. I found out that it works after disabling SELinux.
However, this is not the best option and therefore I changed the context for the bin directories of pgbasenv and pgoperate.
Please add this information to the README.md.

# install semanage utility
yum install policycoreutils-python-utils

# Set and restore the context
semanage fcontext -a -t bin_t "/var/lib/pgsql/tvdtoolbox/pgbasenv/bin(/.*)?"
semanage fcontext -a -t bin_t "/var/lib/pgsql/tvdtoolbox/pgoperate/bin(/.*)?"

restorecon -R -v /var/lib/pgsql/tvdtoolbox/pgbasenv/bin

# Start of the daemon works now.
systemctl start pgoperated-postgres.service

# The new context is shown with -Z option of ls command
ls -lZ /var/lib/pgsql/tvdtoolbox/pgbasenv/bin

Change PIDFile path in systemd service file when upgrading from v4.2 to a newer version

In version v4.2 the path for the file pgoperate-deamon.pid has changed from bin to run folder ($PGOPERATE_BASE/run). As a consequence the PIDFile path in /etc/systemd/system/pgoperated-.service file must be changed as well in case an upgrade from version v4.2 to a newer version is executed.

Otherwise, the pgoperated deamon cannot find the pid file and will restart the postgresql instance every 300 seconds as defined in the standard service file, added by root.sh.

PIDFile=/home/pg600047/tvdtoolbox/pgoperate/run/pgoperate-deamon.pid
Restart=on-failure
TimeoutSec=300

Add severity WARNING to pgoperate --check module

pgoperate --check distinguishes currently between severities ok and critical. Additionally, it should support the severity warning.

Defining archive_command for pgBackRest within PG_DEF_PARAMS lead to wrong postgresql.conf entry at cluster creation

Hi,

If in a cluster template in the PG_DEF_PARAMS section the archive_command parameter is set with the syntax of pgBackRest, the generated postgresql.conf is incorrect.

e.g. the template lines

PG_DEF_PARAMS="
....
archive_command = 'pgbackrest --stanza=pg15-db538c01 archive-push %p'
...
"

results in the postgresql.conf line
archive_command = 'pgbackrest stanza # Modified by PgOperate

So it seems, that special characters like minus char are "parsed away" and everything after the equal sign is also "gone with the wind".

Is there a way to fix this?

Best regards,
Marian

Installer should not overwrite custom_check.lib during an upgrade

The pgoperate installer install_pgoperate.sh must not overwrite the file lib/custom_check.lib during an upgrade since the user has there its custom check functions.
Therefore the installer must check if this file already exists and avoid overwriting it during an upgrade.

--add-standby does not work with hostnames with a dash (-) - replication slot name cannot contain a dash

In case we try to prepare the master it fails if the hostname contains a dash, e.g. vmpg025rz44-ha.

pgoperate --standbymgr --prepare-master
Failed to create replication slot slot_vmpg025rz44-ha.
ERROR: Failed to create replication slot. Check the error message from psql. Fix the issue and try again.

The instance log shows the problem regarding the replication slot.

2021-09-23 13:27:35.966 UTC [9379] LOG: connection authorized: user=pg000130 database=postgres application_name=psql
2021-09-23 13:27:35.967 UTC [9379] ERROR: replication slot name "slot_vmpg025rz44-ha" contains invalid character
2021-09-23 13:27:35.967 UTC [9379] HINT: Replication slot names may only contain lower case letters, numbers, and the underscore character.
2021-09-23 13:27:35.967 UTC [9379] STATEMENT: SELECT * FROM pg_create_physical_replication_slot('slot_vmpg025rz44-ha')

Please remove dashes automatically from replication slot names.

extend input parameters

extend input parameters to accept also "y" and "n" in addition to "yes" and "no"

e.g. remove cluster function

 pgoperate --remove-cluster -a pgtest
Cluster pgtest will be deleted. Cluster base directory including $PGDATA will be removed. Continue? [yes/no]: yes

add possibility to configure VIP

add a feature to optionally configure a VIP(virtual IP) for the standby configuration

VIP should be started automatically on the respective Master node/instance

mkdir and ls error on stdout while adding a standby

I have a new server and the directory db/repconf_<cluster_name> does not exist. In this case the following error message appears. However, the standby setup succeeds but it is not so nice.

vmpg025rz44ha:/home/pg000130 [pg000130]$ pgoperate --standbymgr --add-standby --target vmpg156rz43ha
mkdir: cannot create directory ‘/home/pg000130/tvdtoolbox/pgoperate/db/repconf_pg000130’: No such file or directory
ls: cannot access /home/pg000130/tvdtoolbox/pgoperate/db/repconf_pg000130: No such file or directory
Standby created. Checking status ...
Replication status is 'streaming'. Successful.
Synchronizing config with all standbys.

Once the same standby is created a second time the error is not there anymore because the directory is there.

Wrong argument --sync_config for standbymgr.sh call in pgoperated deamon script

The argument --sync_config of standbymgr.sh is wrong (typo - instead of _) and must be named --sync-config in $PGOPERATE_BASE/bin/pgoperated.
Therefore, corresponding error messages can occur in syslog related to the wrongly spelled command.

pg_wal is not cleaned up on standby site - improve the replication slot handling during switchover and reinstate

I observed that after a switchover with pgoperate --standbymgr --switchover that the obsolete files in $PGDATA/pg_wal are not deleted anymore and we can potentially run into a space issue.
The main issue is that the replication slot directory on the new standby is not removed and therefore the wal files are kept.

We have a fresh deployment with a primary and a standby where the primary has one replication slot.

### Primary
vmpg025rz44:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
NO
vmpg025rz44:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------.  3 pg000247 postgres   22 Nov 17 15:43 .
drwx------. 19 pg000247 postgres 4096 Nov 17 15:46 ..
drwx------.  2 pg000247 postgres   19 Nov 17 15:47 slot_zhh

### Standby
vmpg155rz43:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
YES
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------.  2 pg000247 postgres    6 Nov 17 15:43 .
drwx------. 19 pg000247 postgres 4096 Nov 17 15:46 ..

Then we do a switchover.

pgoperate --standbymgr --switchover --target zhh
Stopping master.
Getting master last checkpoint location and next XID.
waiting for server to promote.... done
server promoted
Replication slot slot_zhh created on vmpg155rz43.
Replication slot slot_olt created on vmpg155rz43.
Starting cluster pg000247.
WAL receiver in streaming mode.
Reinstation complete.
Executing sync on master zhh (vmpg155rz43ha)

Afterwards, the directory for the replication slot remains on the new standby. This prevents the standby from deleting the obsolete wal files!

### Standby
vmpg025rz44:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
YES
vmpg025rz44:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------.  3 pg000247 postgres   22 Nov 17 17:01 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
drwx------.  2 pg000247 postgres   19 Nov 17 17:06 slot_zhh

The new primary gots the new slot (slot_olt) as expected and kept the old one (slot_zhh). Here is the question if it is smart to keep the "old" replication slot. Not sure whether it should be deleted or not. We should discuss that.

### Primary
vmpg155rz43:/home/pg000247 [pg000247]$ echo $TVD_PGIS_STANDBY
NO
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_replslot
total 4
drwx------.  4 pg000247 postgres   38 Nov 17 17:07 .
drwx------. 19 pg000247 postgres 4096 Nov 17 17:07 ..
drwx------.  2 pg000247 postgres   19 Nov 17 17:07 slot_olt
drwx------.  2 pg000247 postgres   19 Nov 17 17:06 slot_zhh

postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
 slot_zhh  |        | physical  |        |          | f         | f      |            |      |              |             |                     |            |               | f
 slot_olt  |        | physical  |        |          | f         | t      |      15914 |      |              | 0/170011E8  |                     | reserved   |               | f
(2 rows)

Lets have a look at the wal logs.

### Primary
vmpg155rz43:/home/pg000247 [pg000247]$ ll $PGDATA/pg_wal/
total 131096
drwx------.  3 pg000247 postgres     4096 Nov 17 17:10 .
drwx------. 19 pg000247 postgres     4096 Nov 17 17:07 ..
-rw-------.  1 pg000247 postgres       41 Nov 17 15:48 00000002.history
-rw-------.  1 pg000247 postgres       83 Nov 17 15:54 00000003.history
-rw-------.  1 pg000247 postgres      126 Nov 17 17:06 00000004.history
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001A
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001B
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001C
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001D
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001E
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001F
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000020
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:06 000000040000000000000021
drwx------.  2 pg000247 postgres     4096 Nov 17 17:10 archive_status


### Standby
ll $PGDATA/pg_wal/
total 163868
drwx------.  3 pg000247 postgres     4096 Nov 17 17:10 .
drwx------. 19 pg000247 postgres     4096 Nov 17 17:07 ..
-rw-------.  1 pg000247 postgres      371 Nov 17 15:48 000000010000000000000008.00000028.backup
-rw-------.  1 pg000247 postgres       41 Nov 17 15:48 00000002.history
-rw-------.  1 pg000247 postgres       83 Nov 17 15:54 00000003.history
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:06 000000030000000000000016
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:07 000000030000000000000017
-rw-------.  1 pg000247 postgres      126 Nov 17 17:07 00000004.history
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000017
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000018
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000019
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001A
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001B
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001C
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001D
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001E
drwx------.  2 pg000247 postgres     4096 Nov 17 17:10 archive_status

As we see the standby has wal logs which are actually not required anymore like:
000000030000000000000016
000000030000000000000017
...
000000040000000000000019
I did a manual checkpoint on both sites but the logs remain.

Manual solution/workaround
I removed the directory on the standby site, restarted it and executed a manual checkpoint on both sites.

### Standby
ls -l $PGDATA/pg_replslot
rm -rf $PGDATA/pg_replslot/slot_zhh

pgoperate --stop
pgoperate --start

psql -c "checkpoint;"


### Primary
psql -c "checkpoint;"

Finally, we can see that the obsolete logs have been deleted on the standby site as well.

### Primary
ll $PGDATA/pg_wal/
total 131092
drwx------.  3 pg000247 postgres     4096 Nov 17 17:14 .
drwx------. 19 pg000247 postgres     4096 Nov 17 17:07 ..
-rw-------.  1 pg000247 postgres       41 Nov 17 15:48 00000002.history
-rw-------.  1 pg000247 postgres       83 Nov 17 15:54 00000003.history
-rw-------.  1 pg000247 postgres      126 Nov 17 17:06 00000004.history
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:14 00000004000000000000001E
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 00000004000000000000001F
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000020
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:06 000000040000000000000021
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000022
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000023
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000024
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000025
drwx------.  2 pg000247 postgres       93 Nov 17 17:14 archive_status


### Standby
ll $PGDATA/pg_wal/
total 163868
drwx------.  3 pg000247 postgres     4096 Nov 17 17:14 .
drwx------. 19 pg000247 postgres     4096 Nov 17 17:13 ..
-rw-------.  1 pg000247 postgres      371 Nov 17 15:48 000000010000000000000008.00000028.backup
-rw-------.  1 pg000247 postgres       41 Nov 17 15:48 00000002.history
-rw-------.  1 pg000247 postgres       83 Nov 17 15:54 00000003.history
-rw-------.  1 pg000247 postgres      126 Nov 17 17:07 00000004.history
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:14 00000004000000000000001E
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:06 00000004000000000000001F
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:07 000000040000000000000020
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000021
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000022
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000023
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000024
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000025
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000026
-rw-------.  1 pg000247 postgres 16777216 Nov 17 17:10 000000040000000000000027
drwx------.  2 pg000247 postgres     4096 Nov 17 17:14 archive_status

add FAQ

add FAQ for pgoperate
with common questions , issues and so on regarding the configuration of pgoperate

Add parameter for --wal-segsize of initdb to parameters_*.conf File

Add a new parameter to the parameter file (template: parameters_mycls.conf.tpl) to define the WAL size used by initdb --wal-segsize=16.
Parameter name could be: PG_WAL_SEGSIZE.

Defining log_line_prefix within PG_DEF_PARAMS has no effect

Hello.

If the log_line_prefix parameter is defined in the cluster template file as a parameter to be set, this has no effect on the generated postgresql.conf file.

PG_DEF_PARAMS="
...
log_line_prefix = '%t [%p-%l]: user:%u,db:%d,app:%a,client:%h '
...
"
Results in postgresql.conf:
log_line_prefix = '%m [%p] ' # special values:

At first moment I thought of a similar problem as in Issue#27, however even simple assignments (e.g. log_line_prefix = 'easy ') are not taken over.

Best regards,
Marian

Support custom restore_command parameter with --standbymgr

pgoperate adds/changes the parameter restore_command within various sub-commands. It uses currently the following value which cannot be changed:
restore_command = 'cp /u01/app/postgres/pg000235/backup/*/wal/%f %p || cp /u01/app/postgres/pg000235/arch/%f %p'

In case the pgoperate --backup/--restore is in use this works fine. However, if we use a third party backup tool and disable the pgoperate sub-commands --backup/--restore in the parameters file (DISABLE_BACKUP_SCRIPTS=yes) this static value will most likely not work anymore.
Therefore, the restore_command must become configurable in pgoperate. I propose to add a parameter in the parameter file (parameters_mycls.conf.tpl) called RESTORE_COMMAND. The default remains as it is now with the static value. But for example with pgbackrest we must set this parameter to:
restore_command = 'pgbackrest --stanza=pg000235 archive-get %f "%p"'

A quick grep showed that the following sub-commands are affected.
grep restore_command pgoperate/bin/* | awk '{ print $1 }' | awk -F : '{ print $1 }' | uniq
pgoperate/bin/create_slave.sh
pgoperate/bin/restore.sh
pgoperate/bin/standbymgr.sh

However, the restore.sh must probably not be changed because this sub-command cannot be used if DISABLE_BACKUP_SCRIPTS=yes.

pgoperate --restore

Current cluster: pg000235
INFO: Backup/recovery disabled by DISABLE_BACKUP_SCRIPTS variable.

change/adapt pid file path

move pid file path from $PGOPERATE_BASE/bin to $PGOPERATE_BASE/run

Error message during switchover

During the switchover an psql error message is shown on the STDOUT but the operation itself succeeds.

vmpg156rz43ha:/home/pg000131 [pg000131]$ pgoperate --standbymgr --switchover
Stopping master.
Getting master last checkpoint location and next XID.
waiting for server to promote.... done
server promoted
Replication slot slot_vmpg025rz44ha created on vmpg156rz43ha.
psql: error: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.49157"?
Starting cluster pg000131.
WAL receiver in streaming mode.
Reinstation complete.
Executing sync on master vmpg156rz43ha
vmpg156rz43ha:/home/pg000131 [pg000131]$

pgoperate --reload does not execute the reload

The following command seems to end up in /dev/null. No reload is executed at all and no log entry is visible.

pgoperate --reload

If I execute a reload with the PostgreSQL command itself I see it in the log.

SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)


# Log entry
2021-05-27 09:04:27.275 CEST [28806] LOG:  received SIGHUP, reloading configuration files

Dedicated configuration parameter to define the hostname for --standbymgr communication between master/standby

Currently, the --standbymgr subcommand uses the $HOSTNAME for the communication between master and standby. This must be configurable. Reason:
I have a host with three network interfaces (eth0/1/2). The hostname refers to eth0 but I want to communicate between master and standby via eth2. This cannot be configured properly at the moment, except changing the hostname.

provide clean postgresql.conf

would be great to have a quite clean postgresql.conf
remove all not needed entries from postgresql.conf and keep only
the needed parameters and settings (e.g have only 70 parameters in the file instead of 600)
keep the original file as postgressql.orig for reference

Comments within PG_DEF_PARAMS lead to wrong postgresql.conf entry at cluster creation

If the parameter file of an instance contains comments within the parameter PG_DEF_PARAMS the operation --create-cluster fails respectively the cluster cannot be started.
Example:

vi ~/tvdtoolbox/pgoperate/etc/parameters_pg000400.conf

PG_DEF_PARAMS="
<...>
# Some comment
#log_statement='ddl'
<...>
"

Create cluster fails:

pgoperate --create-cluster -a pg000400.conf
<...>
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.

Reason: The following line can be found at the end of postgresql.conf.
= # Added by PgOperate

The option --bidirectional for --set-sync/--set-async works only from master

If I set the standby configuration to sync or async with the option --bidirectional it works only when I execute the command on the master.
pgoperate --standbymgr --set-sync --target vmpg155rz43ha --bidirectional

In case the command is executed on the standby the parameter synchronous_standby_names is only set on the master side but not on the standby. Means after a switchover operation the log transport is not sync anymore.