tritondatacenter / pg_prefaulter Goto Github PK

View Code? Open in Web Editor NEW

55.0 45.0 13.0 7.88 MB

Faults pages into PostgreSQL shared_buffers or filesystem caches in advance of WAL apply

License: Apache License 2.0

Makefile 6.78% Go 92.01% Shell 0.38% DTrace 0.83%

pg_prefaulter's Introduction

`pg_prefaulter`

Introduction

pg_prefaulter is used to mitigate the effects of PostgreSQL replication lag.

In a primary-follower database cluster using PostgreSQL streaming replication, the PostgreSQL’s primary database instance enjoys the benefit of being able to parallelize its random IO workload. The follower, however, is not so fortunate. The follower receives a sequential stream of WAL records and sequentially schedules all pread(2) IOs in order in a single process with blocking IO. If the average IO latency per random pread(2) is ~8ms, that means that the follower can only complete ~125 random IO pread(2) calls per second.

The pg_prefaulter reads the stream of WAL records using pg_xlogdump(1) and schedules concurrent IOs in order to prefault PostgreSQL heap pages into the follower’s filesystem cache. The effect of this is dramatic in that the follower is able to apply pages with a filesystem cache-hit.

pg_prefaulter is also smart enough to pull the WAL file from process arguments which allows pg_prefaulter to prefault pages in order to accelerate the recovery and startup time of PostgreSQL. If you’ve ever sat twiddling your thumbs waiting for the database to start up, pg_prefaulter should improve the startup time of PostgreSQL.

If you’ve seen replication lag due to pg_repack, VACUUM, or a write-heavy workload, the pg_prefaulter should reduce apply lag considerably. See the Background section for additional details.

Usage

$ go get github.com/joyent/pg_prefaulter                // (1)
$ pg_prefaulter run                                     // (2)
$ pg_prefaulter -l DEBUG run                            // (3)
$ pg_prefaulter -l DEBUG run -D .pgdata_primary -p 5433 // (4)
$ pg_prefaulter -h                                      // (5)
$ pg_prefaulter run -h                                  // (6)

Build and install pg_prefaulter into $GOPATH/bin (where GOPATH is likely go env GOPATH).
The default mode of running pg_prefaulter
Run pg_preafulter with DEBUG-level logging.
Run pg_prefaulter with various PG connect options.
pg_prefaulter global help options
pg_prefaulter's run-specific options

Note	`pg_prefaulter` exposes itself as a `gops` client and also exposes `expvar` metrics.

expvarmon can be used to interactively monitor the state of pg_prefaulter:

$ expvarmon -ports 4242 -i 1s -vars "mem:memstats.Alloc,mem:memstats.Sys,mem:memstats.HeapAlloc,mem:memstats.HeapInuse,duration:memstats.PauseNs,duration:memstats.PauseTotalNs,str:sender-state,str:peer-sync-state,mem:durability-lag,mem:flush-lag,mem:visibility-lag,duration:visibility-lag_ms,str:db-state,str:connected,str:timeline-id,num-wal-files,num-concurrent-reads,num-concurrent-wals"

Configuration Files

The following is an example configuration file. This can be used instead of CLI arguments. The default configuration filename is pg_prefaulter.toml. The following is a reasonable default pg_prefaulter.toml configuration file.

[log]
# level can be set to "DEBUG", "INFO", "WARN", "ERROR", or "FATAL"
#level = "INFO"

[postgresql]
#pgdata = "pgdata"
#database = "postgres"
#host = "/tmp"
#password = ""
#port = 5432
#user = "postgres"

[postgresql.xlog]
#pg_xlogdump-path = "/usr/local/bin/pg_xlogdump"

[circonus]
#enabled = true

[circonus.api]
# token is a required value if circonus.enabled=true
#token = ""

[circonus.broker]
# id should be set to "35" (the public HTTPTrap broker) if you have enterprise
# brokers configured in your account.
#id = 35

[circonus.check]
#tags = "dc1:mydc"

A complete configuration file with all options listed can be seen in pg_prefaulter.toml.sample-all.

CLI Help

Global Help

$ pg_prefaulter --help

PostgreSQL's WAL-receiver applies WAL files in serial.  This design implicitly
assumes that the heap page required to apply the WAL entry is within the
operating system's filesystem cache.  If the filesystem cache does not contain
the necessary heap page, the PostgreSQL WAL apply process will be block while
the OS faults in the page from its storage.  For large working sets of data or
when the filesystem cache is cold, this is problematic for streaming replicas
because they will lag and fall behind.

pg_prefaulter(1) mitigates this serially scheduled IO problem by
reading WAL entries via pg_xlogdump(1) and performing parallel pread(2) calls in
order to "pre-fault" the page into the OS's filesystem cache so that when the
PostgreSQL WAL receiver goes to apply a WAL entry to its heap, the page is
already loaded into the OS'es filesystem cache.

Usage:
  pg_prefaulter [command]

Available Commands:
  help        Help about any command
  run         Run pg_prefaulter
  version     pg_prefaulter version information

Flags:
  -a, --circonus-api-key string                         Circonus API token
      --circonus-api-url string                         Circonus API URL (default "https://api.circonus.com/v2")
      --circonus-broker-id string                       Circonus Broker ID
      --circonus-broker-max-response-time string        Circonus Broker Max Response Time (default "500ms")
      --circonus-broker-select-tag string               Circonus Broker Select Tag
      --circonus-check-display-name string              Circonus Check Display Name (default "pg_prefaulter")
      --circonus-check-force-metric-activation string   Circonus Check Force Metric Activation (default "false")
      --circonus-check-id string                        Circonus Check ID
      --circonus-check-instance-id string               Circonus Check Instance ID (default "tenacity.local:pg_prefaulter")
      --circonus-check-max-url-age string               Circonus Check Max URL Age (default "5m")
      --circonus-check-search-tag string                Circonus Check Search Tag (default "app:pg_prefaulter,host:tenacity.local")
      --circonus-check-secret string                    Circonus Check Secret
      --circonus-check-tags string                      Circonus Check Tags (default "app:pg_prefaulter")
      --circonus-check-target-host string               Circonus Check Target Host (default "tenacity.local")
      --circonus-debug                                  Enable Circonus Debug
      --circonus-enable-metrics                         Enable Circonus metrics (default true)
      --circonus-submission-url string                  Circonus Check Submission URL
      --config string                                   config file (default "pg_prefaulter.toml")
  -d, --database string                                 postgres (default "postgres")
      --enable-agent                                    Enable the gops(1) agent interface (default true)
      --enable-pprof                                    Enable the pprof endpoint interface (default true)
  -h, --help                                            help for pg_prefaulter
  -H, --hostname string                                 Hostname to connect to PostgreSQL (default "/tmp")
  -F, --log-format string                               Specify the log format ("auto", "zerolog", "human", or "bunyan") (default "auto")
  -l, --log-level string                                Log level (default "INFO")
  -D, --pgdata string                                   Path to PGDATA (default "pgdata")
  -p, --port uint                                       Hostname to connect to PostgreSQL (default 5432)
      --pprof-port uint16                               Specify the pprof port (default 4242)
  -C, --use-color                                       Use ASCII colors (default true)
  -U, --username string                                 Username to connect to PostgreSQL (default "postgres")

Use "pg_prefaulter [command] --help" for more information about a command.

Run Help

$ pg_prefaulter run --help
Run pg_prefaulter and begin faulting in PostgreSQL pages

Usage:
  pg_prefaulter run [flags]

Flags:
  -h, --help                         help for run
  -m, --mode string                  Mode of operation of the database: "auto", "primary", "follower" (default "auto")
  -N, --num-io-threads uint          Number of IO threads to spawn for IOs
  -i, --poll-interval string         Interval to poll the database for state change (default "1s")
      --retry-db-init                Retry connecting to the database during initialization
  -n, --wal-readahead-bytes string   Maximum number of bytes to pre-fault (default "32MiB")
  -X, --xlog-mode string             pg_xlogdump(1) variant: "xlog" or "pg" (default "pg")
  -x, --xlogdump-bin string          Path to pg_xlogdump(1) (default "/usr/local/bin/pg_xlogdump")

Global Flags:
  -a, --circonus-api-key string                         Circonus API token
      --circonus-api-url string                         Circonus API URL (default "https://api.circonus.com/v2")
      --circonus-broker-id string                       Circonus Broker ID
      --circonus-broker-max-response-time string        Circonus Broker Max Response Time (default "500ms")
      --circonus-broker-select-tag string               Circonus Broker Select Tag
      --circonus-check-display-name string              Circonus Check Display Name (default "pg_prefaulter")
      --circonus-check-force-metric-activation string   Circonus Check Force Metric Activation (default "false")
      --circonus-check-id string                        Circonus Check ID
      --circonus-check-instance-id string               Circonus Check Instance ID (default "tenacity.local:pg_prefaulter")
      --circonus-check-max-url-age string               Circonus Check Max URL Age (default "5m")
      --circonus-check-search-tag string                Circonus Check Search Tag (default "app:pg_prefaulter,host:tenacity.local")
      --circonus-check-secret string                    Circonus Check Secret
      --circonus-check-tags string                      Circonus Check Tags (default "app:pg_prefaulter")
      --circonus-check-target-host string               Circonus Check Target Host (default "tenacity.local")
      --circonus-debug                                  Enable Circonus Debug
      --circonus-enable-metrics                         Enable Circonus metrics (default true)
      --circonus-submission-url string                  Circonus Check Submission URL
      --config string                                   config file (default "pg_prefaulter.toml")
  -d, --database string                                 postgres (default "postgres")
      --enable-agent                                    Enable the gops(1) agent interface (default true)
      --enable-pprof                                    Enable the pprof endpoint interface (default true)
  -H, --hostname string                                 Hostname to connect to PostgreSQL (default "/tmp")
  -F, --log-format string                               Specify the log format ("auto", "zerolog", "human", or "bunyan") (default "auto")
  -l, --log-level string                                Log level (default "INFO")
  -D, --pgdata string                                   Path to PGDATA (default "pgdata")
  -p, --port uint                                       Hostname to connect to PostgreSQL (default 5432)
      --pprof-port uint16                               Specify the pprof port (default 4242)
  -C, --use-color                                       Use ASCII colors (default true)
  -U, --username string                                 Username to connect to PostgreSQL (default "postgres")

Development

make freshdb-primary
make freshdb-follower
make resetdb
make build
make test
./pg_prefaulter … # Iterate

To cross-compile and build a release use goreleaser and the make release or make release-snapshot targets. For development, the following should be sufficient:

go get -u github.com/goreleaser/goreleaser
goreleaser --snapshot --skip-validate --skip-publish --rm-dist

$ make
pg_prefaulter make(1) targets:
build            Build pg_prefaulter binary
check            Run go test
fmt              fmt and simplify the code
release-snapshot  Build a snapshot release
release          Build a release
vendor-status    Display the vendor/ status
vet              vet the binary (excluding dependencies)

cleandb-primary  Clean primary database
freshdb-primary  Drops and recreates the primary database
initdb-primary   initdb(1) a primary database
psql-primary     Open a psql(1) shell to the primary
startdb-primary  Start the primary database

cleandb-follower  Clean follower database
freshdb-follower  Drops and recreates the follower database
initdb-follower  initdb(1) a follower database
psql-follower    Open a psql(1) shell to the follower
startdb-follower  Start the follower database

createdb         Create the test database
dropdb           Reset the test database
gendata          Generate data in the primary
resetdb          Drop and recreate the database
testdb           Run database tests

controldata      Display pg_controldata(1) of the primary
psql             Open a psql(1) shell to the primary

clean            Clean target
cleandb-shard    Clean entire shard

help             This help message

Background

PostgreSQL forks a separate process to handle each connection to the database. As database worker processes handle work, they modify database files and commit transactions. Commit of a database transaction flushes an entry into the write-ahead-log (WAL).

PostgreSQL’s streaming-based replication system is based on "log-shipping:" the WAL is streamed to remote hosts. Database replicas apply the changes listed in the instructions within the WAL using the PostgreSQL crash recovery mechanism: they read the log from beginning to end, find the changes to the underlying datafiles (the "PostgreSQL Heap", or "Heap" for short) and make the relevant changes to the database files.

The replica’s WAL-apply operation is performed in a single process, using a single thread. Each time a WAL-entry is "consumed" as the replica reads in the corresponding page of the underlying datafile. The WAL-replay process on the replica waits for the serial execution of disk I/O to complete and load the underlying page(s) from the heap. Unlike the primary, the follower lacks the ability to spread its work across multiple processes. As a consequence, the replicas only perform single-process, single-threaded, blocking IO, and cannot apply the WAL as quickly as the primary who generates the WAL files and are using parallel IO. To add insult to injury, the follower is prone to having its filesystem cache fault, resulting in a physical disk IO, further slowing down the apply process.

Fortunately, the WAL-itself gives perfect information to the replica regarding what I/O operations will be required (for example, "write 50 bytes to relation 16439 at block number 2357235235"). However, for correctness of transaction commit it is important that the WAL be replayed in-order. The pg_prefaulter is a helper application for the replica: it queries the replica PostgreSQL process for information about where it is in the replay. The pg_prefaulter then uses the information encoded in the WAL files themselves to initiate async-I/O operations (with a configurable degree of concurrency) with the goal of giving the operating system a chance to load the file into the filesystem cache before the PostgreSQL startup process (e.g. postgres: startup process recovering 00000001000002870000005E) fetches the page from the filesystem. Once the portions of the database files are resident in filesystem cache, the PostgreSQL replica will be able to access them without having to wait so long because the necessary page will already be resident in the filesystem cache. The end result is that even though the replica is still a single-process with a single thread, access to random pages will return quickly because the pread(2) call will result in a filesystem cache-hit operation (versus the pessimistic path where the page faults in both PostgreSQL’s cache, the VFS cache, and the apply process blocks performing an an extended pread(2) operation).

Before: Replication apply lag before and during the recovery once pg_prefaulter was deployed
After: Replication apply lag as observed during the same workload seen earlier while pg_prefaulter is still running

pg_prefaulter's People

Contributors

Stargazers

Watchers

Forkers

grubernaut mble chudley etsangsplk isabella232 pw374 anayrat bschofield postgresnx sebastianwebber dbabiak

pg_prefaulter's Issues

would like tool for manually prefaulting specific files

It would be very handy to have a tool that would skip the part where we talk to PostgreSQL to figure out what WAL files to process and instead allow an operator to explicitly process specific WAL files.

Any updates?

Hey guys,

I've been wondering why this project went totally left out. I found it pretty interesting but I was eager to know the reason of this inactivity. Is it just because of loss of interest of contributors or is it somehow proven to be harmful to use?

I would be glad to hear from you.
Thanks.

Commits for pg10+ change functional behaviour in a possibly unintended manner.

In this commit, a translation layer is added to enable pg_prefaulter to work with postgres 10+.

However, the old SQL query

SELECT timeline_id, redo_location, pg_last_xlog_replay_location() FROM pg_control_checkpoint()

is translated as

SELECT timeline_id, redo_lsn, pg_last_wal_receive_lsn() FROM pg_control_checkpoint()

This seems to have the effect of causing the code to attempt to prefault files just ahead of the most-recently-received WAL. The old behaviour appears to have been to prefault just ahead of the most-recently-replayed WAL. I am unsure whether this change in functionality was intended, but the old behaviour does seem more logical to me.

To revert to the old behaviour, change pg_last_wal_receive_lsn() to pg_last_wal_replay_lsn().

logs should be in the bunyan format

Though at least some log messages are presently formatted as linefeed-delimited JSON records, the content of the messages is not quite right. The log records must contain at least the Bunyan core fields in order for the bunyan tools to parse them correctly.

Pre-fault in the entire WAL file for the pg_last_xlog_receive_location() segment...

pg_prefaulter should pre-fault in the entire WAL segment referenced by pg_last_xlog_receive_location() and the necessary segments as specified by the read ahead.

It is conceivable that the WAL receiver could be blocked on a RMW cycle when using a page size that is different than PostgreSQL's 8K pages. Warming the filesystem cache in advance of the receiver writing a WAL record would prevent the synchronous read(2) call from blocking replication when synchronous_commit=remote_write is being used. This problem will likely only manifest itself after PostgreSQL has begun recycling its WAL files.

Is this project still being maintained?

I am considering using this tool to alleviate lag on slaves due to heavy primary activity, but I see many issues created 3 years ago with no responses and the last update was about a year and a half ago.

Can I get a response, please, to let me know about the future of this project?

Getting error while starting pg_prefaulter.

Getting error while starting pg_prefaulter.
Error: unable to start agnet: unable to create a stats agent: invalid check manager configuration
I tried to disable Circonus but still no luck.

Any pointers would be greatly appreciated.
Thanks
J

program handles SIGTERM but does not shut down

The prefaulter was running under SMF. I disabled the service with the usual svcadm disable pg_prefaulter, but the usual stop method timed out. It appears that the software does not correctly shut down on SIGTERM. Log entries from the failure, including SMF log messages:

[ Sep 13 03:16:25 Stopping because service disabled. ]
[ Sep 13 03:16:25 Executing stop method (:kill). ]
{"time":"2017-09-13T03:16:25.340818386Z","level":"info","signal":"terminated","message":"Received signal"}
{"time":"2017-09-13T03:16:25.341335305Z","level":"debug","message":"Shutting down"}
{"time":"2017-09-13T03:16:26.167822291Z","level":"error","error":"unable to execute primary check: dial unix /tmp/.s.PGSQL.5432: connect: connection refused","message":"unable to determine if database is primary or not, retrying"}
[ Sep 13 03:17:25 Method or service exit timed out.  Killing contract 1830449. ]
[ Sep 13 03:17:26 Method or service exit timed out.  Killing contract 1830449. ]
[ Sep 13 03:17:27 Method or service exit timed out.  Killing contract 1830449. ]
[ Sep 13 03:17:28 Method or service exit timed out.  Killing contract 1830449. ]
[ Sep 13 03:17:29 Method or service exit timed out.  Killing contract 1830449. ]

Note that the failure to connect to the database is expected, as this was during zone shutdown. The manatee-sitter service (and thus PostgreSQL) were already offline at the time.

Stream WAL information to a sidecar pg_prefaulter process via a bgworker...

It would be clever to hook in to PostgreSQL as a background worker. There are actually two candidate paths for this enhancement in the future:

stream the WAL information over an RPC socket to a sidecar pg_prefaulter process
outright convert pg_prefaulter into a background worker

The former has benefits because you could stop and upgrade the sidecar process without restarting PostgreSQL. The latter is more inline with the traditional monolithic C mindset. Given the low volume of RPCs, the former is probably preferred over the latter but additional investigation is required before making a determination.

Inspiration taken from: https://github.com/prest/bgworker

Fix the Circonus enable/disable...

For reasons that aren't clear, the enable/disable boolean doesn't work for Circonus metrics.

As a workaround, use an invalid API key.

Reported by @gwydirsam

close all files when PostgreSQL is not running

When PostgreSQL is not running, the prefaulter appears to retain a large number of open files within the PostgreSQL data directory. So that we can cleanly unmount and remount the data file system in Manatee, the prefaulter needs to close these descriptors promptly, and avoid opening any new ones, until it is determined that PostgreSQL is running again. It would also be good to make sure the current working directory of the prefaulter process never resides within the PostgreSQL data directory.

Add markdown generation targets

https://github.com/spf13/cobra/blob/master/doc/md_docs.md

Add status command

Add a generic status command that shows:

how far behind a given followers are when connected to the primary
how far behind a given follower is when connected to the follower
dumps various cache stats

Use posix_fadvise(2) instead of pread(2) for faulting pages

Suggestion from the audience:

Flag: POSIX_FADV_WILLNEED

While the memory bandwidth consumed from copying data from the OS into userland is not much, there is room for improvement through the use of posix_fadvise(2) instead of pread(2).

Automatically register with Consul if detected

If Consul is running, automatically register the pg_prefaulter as a service in Consul so that if the pg_prefaulter can trap out its health and state to Consul.

Add a SIGINFO handler...

Change the SIGINFO handler to dump stats (vs show the stack). The current behavior dumps the stack of the process, which is useful, but not necessarily desired.

Reduce ARC usage of the IO cache...

While writing this I defaulted to using string slices as the arguments, which is efficient, but not necessarily ideal.

The current _IOCacheKey is too large:

type _IOCacheKey struct {
        Tablespace string
        Database   string
        Relation   string
        Block      string
}

_IOCacheKey.Tablespace string: 0-16 (size 16, align 8)
_IOCacheKey.Database string: 16-32 (size 16, align 8)
_IOCacheKey.Relation string: 32-48 (size 16, align 8)
_IOCacheKey.Block string: 48-64 (size 16, align 8)

Plus the size of byte array backing the string. It could/should be:

// type'ed uint64's
type _IOCacheKey struct {
        Tablespace uint64
        Database   uint64
        Relation   uint64
        Block      uint64
}

_IOCacheKey.Tablespace string: 0-8 (size 8, align 8)
_IOCacheKey.Database string: 8-16 (size 8, align 8)
_IOCacheKey.Relation string: 16-24 (size 8, align 8)
_IOCacheKey.Block string: 24-32 (size 8, align 8)

This change will incur a parsing overhead but we have CPU to burn in the common case and are memory limited, not CPU limited. The current CPU overhead of pg_prefaulter running on a busy follower is between 1-2% CPU, so we have CPU budget to spare.

Add manpage generation targets...

https://github.com/spf13/cobra/blob/master/doc/man_docs.md

Dump caches on signal...

On a running process, when it receives a signal (probably a SIGHUP or SIGUSR2), purge all of the caches.

panic: bad, open vs close count not the same after purge

During a recent incident, the prefaulter panicked on more than 200 PostgreSQL instances, with the following set of log messages:

19:01:00.598Z ERROR pg_prefaulter: unable to find WAL files (next step=retrying)
    error: unable to query PostgreSQL checkpoint information: write unix ->/tmp/.s.PGSQL.5432: write: broken pipe: (retriable: true, purge cache: true
19:01:01.599Z LVL70 pg_prefaulter: bad, open vs close count not the same after purge (close-count=769548809, open-count=981315695)
panic: bad, open vs close count not the same after purge

Core files for the failures are tagged with this ticket in thoth.

expose statistics through the Prometheus protocol

As part of the broader effort to provide statistics and monitoring for Manta components, we are moving towards exposing a Prometheus client in each software component. The rationale and some details about the plan for this are described in RFD 99.

If this was a Node program, I'd say we would use artedi. I'm not sure what the Prometheus client library landscape is like in the Go ecosystem, but I have to imagine it's pretty reasonable as Prometheus itself is written in Go!

The listen port for Prometheus connections should be configurable.

Scrape the WAL file from the process args when PostgreSQL is starting up...

When PostgreSQL is starting up, it's not possible to connect to PG to query the WAL file that it's replaying. Fortunately this information is available in the process title. Unfortunately this is very OS-specific. Doubly unfortunate, this is often very important in recovery situations when PG is busy applying WAL files which means it's absolutely worth while to do something OS-specific and dirty (read: scraping information from the OS to figure out what WAL file we need to interrogate).

Add support for bash/zsh tab completion targets...

Add a target to pg_prefaulter that dumps the necessary tab completion targets.

https://github.com/spf13/cobra/blob/master/bash_completions.md

prefaulter must start up even if database is not available

It would appear that the prefaulter does not start up correctly if the database is not available initially. When run as an SMF service, it fails fast this way a few times and then the service goes into maintenance:

...
unable to start agent: unable to initialize db connection pool: unable to create a new DB connection pool: dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory
[ Sep 13 03:18:10 Stopping because all processes in service exited. ]
[ Sep 13 03:18:10 Executing stop method (:kill). ]
[ Sep 13 03:18:10 Executing start method ("/opt/pg_prefaulter/pg_prefaulter run --config=/opt/pg_prefaulter/pg_prefaulter.toml &"). ]
[ Sep 13 03:18:10 Method "start" exited with status 0. ]
{"time":"2017-09-13T03:18:10.84142588Z","level":"debug","config-file":"/opt/pg_prefaulter/pg_prefaulter.toml"}
{"time":"2017-09-13T03:18:10.841509168Z","level":"debug","message":"args: []"}
{"time":"2017-09-13T03:18:10.841778917Z","level":"debug","message":"starting gops(1) agent"}
{"time":"2017-09-13T03:18:10.841794234Z","level":"debug","postgresql.pgdata":"/manatee/pg/data","postgresql.host":"/tmp","postgresql.port":5432,"postgresql.user":"postgres","postgresql.xlog.mode":"pg","postgresql.xlog.pg_xlogdump-path":"/opt/postgresql/current/bin/pg_xlogdump","postgresql.poll-interval":1000,"message":"flags"}
{"time":"2017-09-13T03:18:10.842199269Z","level":"info","message":"Starting pg_prefaulter"}
{"time":"2017-09-13T03:18:10.842205727Z","level":"info","message":"Stopped pg_prefaulter"}
Error: unable to start agent: unable to initialize db connection pool: unable to create a new DB connection pool: dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory
Usage:
  pg_prefaulter run [flags]

Flags:
  -h, --help                   help for run
  -m, --mode string            Mode of operation of the database: "auto", "primary", "follower" (default "auto")
  -N, --num-io-threads uint    Number of IO threads to spawn for IOs
  -i, --poll-interval string   Interval to poll the database for state change (default "1s")
  -n, --wal-readahead uint     Number of WAL entries to perform read-ahead into (default 4)
  -X, --xlog-mode string       pg_xlogdump(1) variant: "xlog" or "pg" (default "pg")
  -x, --xlogdump-bin string    Path to pg_xlogdump(1) (default "/usr/local/bin/pg_xlogdump")

Global Flags:
  -a, --circonus-api-key string                         Circonus API token
      --circonus-api-url string                         Circonus API URL (default "https://api.circonus.com/v2")
      --circonus-broker-id string                       Circonus Broker ID
      --circonus-broker-max-response-time string        Circonus Broker Max Response Time (default "500ms")
      --circonus-broker-select-tag string               Circonus Broker Select Tag
      --circonus-check-display-name string              Circonus Check Display Name (default "pg_prefaulter")
      --circonus-check-force-metric-activation string   Circonus Check Force Metric Activation (default "false")
      --circonus-check-id string                        Circonus Check ID
      --circonus-check-instance-id string               Circonus Check Instance ID (default "4042552b-5fe0-45c4-9df4-f21eca88464d:pg_prefaulter")
      --circonus-check-max-url-age string               Circonus Check Max URL Age (default "5m")
      --circonus-check-search-tag string                Circonus Check Search Tag (default "app:pg_prefaulter,host:4042552b-5fe0-45c4-9df4-f21eca88464d")
      --circonus-check-secret string                    Circonus Check Secret
      --circonus-check-tags string                      Circonus Check Tags (default "app:pg_prefaulter")
      --circonus-check-target-host string               Circonus Check Target Host (default "4042552b-5fe0-45c4-9df4-f21eca88464d")
      --circonus-debug                                  Enable Circonus Debug
      --circonus-enable-metrics                         Enable Circonus metrics (default true)
      --circonus-submission-url string                  Circonus Check Submission URL
      --config string                                   config file (default "pg_prefaulter.toml")
  -d, --database string                                 postgres (default "postgres")
      --enable-agent                                    Enable the gops(1) agent interface (default true)
  -H, --hostname string                                 Hostname to connect to PostgreSQL (default "/tmp")
  -l, --log-level string                                Log level (default "INFO")
  -D, --pgdata string                                   Path to PGDATA (default "pgdata")
  -p, --port uint                                       Hostname to connect to PostgreSQL (default 5432)
  -U, --username string                                 Username to connect to PostgreSQL (default "postgres")

unable to start agent: unable to initialize db connection pool: unable to create a new DB connection pool: dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory
[ Sep 13 03:18:10 Stopping because all processes in service exited. ]
[ Sep 13 03:18:10 Executing stop method (:kill). ]
[ Sep 13 03:18:10 Restarting too quickly, changing state to maintenance. ]

Once the database is running, I can clear the maintenance status and the service appears to start correctly:

[ Sep 13 03:20:11 Leaving maintenance because clear requested. ]
[ Sep 13 03:20:11 Enabled. ]
[ Sep 13 03:20:11 Executing start method ("/opt/pg_prefaulter/pg_prefaulter run --config=/opt/pg_prefaulter/pg_prefaulter.toml &"). ]
[ Sep 13 03:20:11 Method "start" exited with status 0. ]
{"time":"2017-09-13T03:20:11.206137022Z","level":"debug","config-file":"/opt/pg_prefaulter/pg_prefaulter.toml"}
{"time":"2017-09-13T03:20:11.20630263Z","level":"debug","message":"args: []"}
{"time":"2017-09-13T03:20:11.20648666Z","level":"debug","message":"starting gops(1) agent"}
{"time":"2017-09-13T03:20:11.20670297Z","level":"debug","postgresql.pgdata":"/manatee/pg/data","postgresql.host":"/tmp","postgresql.port":5432,"postgresql.user":"postgres","postgresql.xlog.mode":"pg","postgresql.xlog.pg_xlogdump-path":"/opt/postgresql/current/bin/pg_xlogdump","postgresql.poll-interval":1000,"message":"flags"}
{"time":"2017-09-13T03:20:11.207273131Z","level":"info","message":"Starting pg_prefaulter"}
{"time":"2017-09-13T03:20:11.226491329Z","level":"debug","backend-pid":63146,"version":"PostgreSQL 9.6.4 on i386-pc-solaris2.11, compiled by gcc (GCC) 4.9.3, 64-bit","message":"established DB connection"}
{"time":"2017-09-13T03:20:11.226914389Z","level":"debug","rlimit-nofile":65536,"filehandle-cache-size":65486,"filehandle-cache-ttl":3600000,"message":"filehandle cache initialized"}
{"time":"2017-09-13T03:20:11.243983658Z","level":"info","io-worker-threads":1000,"message":"started IO worker threads"}
{"time":"2017-09-13T03:20:11.244136082Z","level":"debug","message":"Starting wait"}
{"time":"2017-09-13T03:20:11.244148015Z","level":"debug","wal-worker-thread-id":2,"message":"starting WAL worker thread"}
{"time":"2017-09-13T03:20:11.244151999Z","level":"debug","wal-worker-thread-id":1,"message":"starting WAL worker thread"}
{"time":"2017-09-13T03:20:11.244147788Z","level":"debug","wal-worker-thread-id":3,"message":"starting WAL worker thread"}
{"time":"2017-09-13T03:20:11.244208579Z","level":"debug","wal-worker-thread-id":0,"message":"starting WAL worker thread"}
{"time":"2017-09-13T03:20:11.244208705Z","level":"debug","message":"Starting pg_prefaulter agent"}
{"time":"2017-09-13T03:21:11.227143486Z","level":"debug","hit":23176,"miss":646,"lookup":23822,"hit-rate":0.9728822097221056,"message":"filehandle-stats"}
{"time":"2017-09-13T03:21:11.24432601Z","level":"debug","hit":1,"miss":2,"lookup":3,"hit-rate":0.3333333333333333,"message":"walcache-stats"}
{"time":"2017-09-13T03:21:11.244413444Z","level":"debug","hit":8178,"miss":41757,"lookup":49935,"hit-rate":0.16377290477620907,"message":"iocache-stats"}
{"time":"2017-09-13T03:22:11.228855455Z","level":"debug","hit":113623,"miss":734,"lookup":114357,"hit-rate":0.9935815035371687,"message":"filehandle-stats"}
{"time":"2017-09-13T03:22:11.244911682Z","level":"debug","hit":3,"miss":4,"lookup":7,"hit-rate":0.42857142857142855,"message":"walcache-stats"}
{"time":"2017-09-13T03:22:11.244947642Z","level":"debug","hit":57646,"miss":191404,"lookup":249050,"hit-rate":0.23146356153382855,"message":"iocache-stats"}
{"time":"2017-09-13T03:23:11.229500586Z","level":"debug","hit":199725,"miss":754,"lookup":200479,"hit-rate":0.9962390075768535,"message":"filehandle-stats"}
{"time":"2017-09-13T03:23:11.245253655Z","level":"debug","hit":6,"miss":8,"lookup":14,"hit-rate":0.42857142857142855,"message":"walcache-stats"}
{"time":"2017-09-13T03:23:11.246306122Z","level":"debug","hit":128811,"miss":331991,"lookup":460802,"hit-rate":0.279536547150403,"message":"iocache-stats"}
{"time":"2017-09-13T03:24:11.230050485Z","level":"debug","hit":240793,"miss":759,"lookup":241552,"hit-rate":0.9968578194343247,"message":"filehandle-stats"}
{"time":"2017-09-13T03:24:11.245531156Z","level":"debug","hit":8,"miss":10,"lookup":18,"hit-rate":0.4444444444444444,"message":"walcache-stats"}
{"time":"2017-09-13T03:24:11.246573786Z","level":"debug","hit":165312,"miss":400260,"lookup":565572,"hit-rate":0.29229169760879253,"message":"iocache-stats"}
{"time":"2017-09-13T03:25:11.230626347Z","level":"debug","hit":282746,"miss":766,"lookup":283512,"hit-rate":0.9972981743277181,"message":"filehandle-stats"}
{"time":"2017-09-13T03:25:11.24583358Z","level":"debug","hit":10,"miss":12,"lookup":22,"hit-rate":0.45454545454545453,"message":"walcache-stats"}
{"time":"2017-09-13T03:25:11.246880287Z","level":"debug","hit":203032,"miss":468442,"lookup":671474,"hit-rate":0.3023676270414044,"message":"iocache-stats"}
{"time":"2017-09-13T03:26:11.231247847Z","level":"debug","hit":325980,"miss":773,"lookup":326753,"hit-rate":0.9976342986904482,"message":"filehandle-stats"}
{"time":"2017-09-13T03:26:11.2461834Z","level":"debug","hit":12,"miss":14,"lookup":26,"hit-rate":0.46153846153846156,"message":"walcache-stats"}
{"time":"2017-09-13T03:26:11.247218878Z","level":"debug","hit":241664,"miss":538432,"lookup":780096,"hit-rate":0.3097875133316925,"message":"iocache-stats"}
{"time":"2017-09-13T03:27:11.231782623Z","level":"debug","hit":367433,"miss":778,"lookup":368211,"hit-rate":0.9978870810486379,"message":"filehandle-stats"}
{"time":"2017-09-13T03:27:11.246419942Z","level":"debug","hit":14,"miss":16,"lookup":30,"hit-rate":0.4666666666666667,"message":"walcache-stats"}
{"time":"2017-09-13T03:27:11.247501626Z","level":"debug","hit":278505,"miss":607739,"lookup":886244,"hit-rate":0.3142531853530179,"message":"iocache-stats"}
{"time":"2017-09-13T03:28:11.232316668Z","level":"debug","hit":430939,"miss":780,"lookup":431719,"hit-rate":0.9981932692329964,"message":"filehandle-stats"}
{"time":"2017-09-13T03:28:11.246682366Z","level":"debug","hit":18,"miss":20,"lookup":38,"hit-rate":0.47368421052631576,"message":"walcache-stats"}
{"time":"2017-09-13T03:28:11.247703691Z","level":"debug","hit":356427,"miss":744053,"lookup":1100480,"hit-rate":0.32388321459726666,"message":"iocache-stats"}
...

prefaulter's child process hangs onto xlog files after database is shut down

We've found another case where the prefaulter implicitly hangs onto references to a database's filesystem after the database itself has shut down. This is a problem for Manatee because Manatee attempts to unmount and mount the filesystem as part of starting the database (itself a separate problem), but it can't unmount it while the prefaulter is holding these references. This is a similar problem as in #13, but a different way that it can happen.

In this case, there was only one process holding open a file on the database's dataset:

[root@fa0e1075 (postgres) ~]$ fuser -c /manatee/pg
/manatee/pg:     8115o

It's a pg_xlogdump process forked by the prefaulter:

[root@HA99RHND2 (eu-central-1b) ~]# ptree 8115
892   zsched
  975   /sbin/init
    7127  /opt/pg_prefaulter/pg_prefaulter run --config=/opt/pg_prefaulter/pg_prefau
      8115  /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/0000

It's not holding a file via its working directory:

[root@fa0e1075 (postgres) ~]$ pwdx 8115
8115:   /root

But rather it seems to be one of its open fds:

[root@HA99RHND2 (eu-central-1b) ~]# pfiles 8115
8115:   /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00
  Current rlimit: 65536 file descriptors
   0: S_IFCHR mode:0666 dev:561,8 ino:3001421048 uid:0 gid:3 rdev:38,2
      O_RDONLY|O_LARGEFILE
      /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/dev/null
      offset:0
   1: S_IFIFO mode:0000 dev:558,0 ino:29756267 uid:0 gid:0 rdev:0,0
      O_RDWR
   2: S_IFIFO mode:0000 dev:558,0 ino:29756268 uid:0 gid:0 rdev:0,0
      O_RDWR
   3: S_IFREG mode:0600 dev:90,65658 ino:65565 uid:907 gid:0 size:16777216
      O_RDONLY|O_LARGEFILE
      offset:14827520

Unfortunately we don't have the path for fd 3, but it seems almost certainly the WAL file that was given as an argument to the program:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# pargs core.8115 
core 'core.8115' of 8115:	/opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00000001000
argv[0]: /opt/postgresql/current/bin/pg_xlogdump
argv[1]: -f
argv[2]: /manatee/pg/data/pg_xlog/000000010000118600000084

The inode number reported by stat on that file matches the one that pfiles reported:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# stat /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/000000010000118600000084
  File: `/zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/000000010000118600000084'
  Size: 16777216  	Blocks: 14213      IO Block: 8192   regular file
Device: 169007ah/23658618d	Inode: 65565       Links: 1
Access: (0600/-rw-------)  Uid: (  907/ UNKNOWN)   Gid: (    0/    root)
Access: 2017-10-13 16:34:10.812840776 +0000
Modify: 2017-10-13 16:34:44.373763334 +0000
Change: 2017-10-13 16:34:44.373763334 +0000
 Birth: -

The "-f" flag appears to block when it reaches the end of valid WAL data. And it does look like we're at the end:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# ls -l /zones/fa0e1075-8c65-4e2c-a062-e23013e40247/root/manatee/pg/data/pg_xlog/ | tail
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007D
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007E
-rw-------   1 907      root     16777216 Oct 13 16:33 00000001000011860000007F
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000080
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000081
-rw-------   1 907      root     16777216 Oct 13 16:33 000000010000118600000082
-rw-------   1 907      root     16777216 Oct 13 16:34 000000010000118600000083
-rw-------   1 907      root     16777216 Oct 13 16:34 000000010000118600000084
-rw-------   1 907      root     16777216 Oct 13 05:46 000000010000118600000085
drwx------   2 907      907        20002 Oct 13 16:35 archive_status

There's one more file, but it has a much older mtime -- it looks like that might be a recycled file. It's not valid at all:

[root@fa0e1075 (postgres) /manatee/pg/data/pg_xlog]$ pg_xlogdump 000000010000118600000085 
pg_xlogdump: FATAL:  could not find a valid record after 1186/85000000

and xxd shows that file "84" ends in all zeros, which suggests it's all invalid:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# xxd -a 000000010000118600000084 | tail
...
0e22000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0fffff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

Offset e22000 is 14819328, just 8K behind where the pg_xlogdump process was in the pfiles output. And pg_xlogdump is in a loop sleeping from main():

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# pstack 8115
8115:	/opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00
 ffffbf7fff25532a pollsys  (ffffbf7fffdfd6a0, 0, ffffbf7fffdfd790, 0)
 ffffbf7fff1ea27b pselect (0, 0, 0, 0, ffffbf7fffdfd790, 0) + 26b
 ffffbf7fff1ea62a select (0, 0, 0, 0, ffffbf7fffdfd7f0) + 5a
 00000000004102cb pg_usleep () + 92
 000000000040853c main () + a5e
 000000000040603c _start () + 6c
[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# truss -p 8115
pollsys(0xFFFFBF7FFFDFD6A0, 0, 0xFFFFBF7FFFDFD790, 0x00000000) = 0
lseek(3, 0, SEEK_SET)				= 0
read(3, "93D007\001\0\0\0\0\0\084".., 8192)	= 8192
lseek(3, 14811136, SEEK_SET)			= 14811136
read(3, "93D005\001\0\0\0\0\0E284".., 8192)	= 8192
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 8192)	= 8192

which is clearly what it does when it reaches the end of the WAL file:

1007         for (;;)
1008         {
1009                 /* try to read the next record */
1010                 record = XLogReadRecord(xlogreader_state, first_record, &errormsg);
1011                 if (!record)
1012                 {
1013                         if (!config.follow || private.endptr_reached)
1014                                 break;
1015                         else
1016                         {
1017                                 pg_usleep(1000000L);    /* 1 second */
1018                                 continue;
1019                         }
1020                 }

On timing, this process started at 16:34:11:

[root@HA99RHND2 (eu-central-1b) /var/tmp/dap]# ps -opid,stime,args  -p 8115
  PID    STIME COMMAND
 8115 16:34:11 /opt/postgresql/current/bin/pg_xlogdump -f /manatee/pg/data/pg_xlog/00000001000

and was still running until we killed it well after 16:58Z. The manatee-sitter log shows that it was at 16:40 that we brought down postgres, and shortly after that that we started it again.

I have saved:

core file, pfiles output, and the raw binary from pid 8115 (pg_xlogdump)
core file, pfiles output, and the raw binary from pid 7127 (pg_prefaulter)
the manatee-sitter log showing a timeline of events leading up to this.
the WAL file itself, 000000010000118600000084

I will tar all of these up and upload them to "/thoth/stor/tickets/pg_prefaulter#40" in the JPC Manta.

Expose `expvar` interface to enable online parameter twiddling...

https://golang.org/pkg/expvar/

Add and use sampling loggers

Add a sampling logger to enable high-volume error points that provide sufficient fidelity of the underlying condition but doesn't drown the system in IO or useless repeated data. Zerolog supports this but a global singleton sampling logger needs to be present someplace.

https://github.com/rs/zerolog#log-sampling

Add a colorized console logger for pretty-printing...

Change the console logger to use https://github.com/mattn/go-isatty and use the ConsoleWriter for pretty-printing when stdout is hooked up to a TTY (and not explicitly disabled in the config or on the CLI).

may miss some pages depending on xlog-mode

There are two regular expressions used to parse the output of either pg_xlogdump or https://github.com/snaga/xlogdump.

https://github.com/joyent/pg_prefaulter/blob/master/agent/walcache/cache.go#L50
and
https://github.com/joyent/pg_prefaulter/blob/master/agent/walcache/cache.go#L56

When operating in xlog-mode (ie using https://github.com/snaga/xlogdump), the output for different operations is different; and the regular expression may not match all operations.

HEAP_INSERT https://github.com/snaga/xlogdump/blob/master/xlogdump_rmgr.c#L726
HEAP_DELETE https://github.com/snaga/xlogdump/blob/master/xlogdump_rmgr.c#L769

pg_xlogdump-mode is the default; and that mode does not appear to have this problem.