crunchydata / crunchy-watch Goto Github PK
View Code? Open in Web Editor NEWA PostgreSQL Automated Failover Container
Home Page: http://www.crunchydata.com
License: Apache License 2.0
A PostgreSQL Automated Failover Container
Home Page: http://www.crunchydata.com
License: Apache License 2.0
in testing watch, we need a means to cause Postgres to not start up, this is a different test than just killing the primary pod to cause the failover.
in this scenario, watch will detect it can't reach postgres even if Kube restarts the pod over and over.
The Makefile does not have a build path for building the docker.so module for this container.
Therefor use of attempted use of CRUNCHY_WATCH_PLATFORM=docker
results in a Seg Fault.
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | plugin.Open(/opt/cpm/bin/crunchy-watch/plugins/docker.so): realpath failed
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | panic: runtime error: invalid memory address or nil pointer dereference
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1571ebe]
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 |
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | goroutine 1 [running]:
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | main.loadPlatformModule(0x7ffc0778ae27, 0x6, 0xc4200dbe08, 0x1)
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | /home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/util.go:42 +0x1de
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | main.main()
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02 | /home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/main.go:229 +0x316
There is a logic error in the main script that causes a panic if KUBE_PROJECT
or OSE_PROJECT
isn't set.
Starting crunchy watch
Started...
env var debug value is []
panic: runtime error: index out of range
goroutine 1 [running]:
main.main()
/opt/cdev/src/github.com/crunchydata/crunchy-watch/main.go:218 +0x1237
Correct me if I'm wrong but os.Args
always returns at least one value. This should be < 2
.
The Dockerfile
run script should also enforce this variable being set.
Hi, thanks for very useful containers. Unfortunately I have a problem with deletion of pod which is primary and crunchy-watch would like to exchange it. First step is to delete old primary pod and this part does not work ending with: Error from server: pods "name=postgresql-primary" not found
.
I found code responsible for that:
crunchy-watch/plugins/openshift/main.go
Line 93 in 703256e
When I test it from cli: oc delete pod "name=postgres-primary"
it didn't work either.
The remedy for this issue is to add -l
flag which allows pod deletion by label, e.g., oc delete pod -l "name=postgres-primary"
.
instead of just Pods, support killing a Primary as part of a Deployment.
after client-go is implmented, make sure oc and kubectl binaries are removed
instead of running watch in a simple pod, run it within a Kube Deployment, in all examples.
Line 322 in 242720c
setting an env var named DEBUG should cause log.Debug messages to be produced in the container log.
The connection constructors are currently hardcoded to not use SSL:
Lines 266 to 282 in c0e8a18
Please consider changing this to allow or require SSL.
Thanks!
below are the logs when primary is deleted -
Starting crunchy watch
Started...
INFO[2018-10-22T15:44:54Z] Loading Platform Module: kube
INFO[2018-10-22T15:44:54Z] Waiting for signal...
INFO[2018-10-22T15:44:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:44:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:24Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:34Z] dial tcp 10.105.143.178:5432: i/o timeout
ERRO[2018-10-22T15:46:34Z] Could not reach 'crunchy-primary' (Attempt: 1)
INFO[2018-10-22T15:46:34Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-10-22T15:46:34Z] Processing Failover: Strategy - latest
INFO[2018-10-22T15:46:34Z] Deleting existing primary...
ERRO[2018-10-22T15:46:34Z] pods is forbidden: User "system:serviceaccount:default:pg-watcher" cannot deletecollection pods in the namespace "default"
ERRO[2018-10-22T15:46:34Z] An error occurred while deleting the old primary
INFO[2018-10-22T15:46:34Z] Deleted old primary
INFO[2018-10-22T15:46:34Z] Choosing failover replica...
ERRO[2018-10-22T15:46:34Z] Error getting pods command
INFO[2018-10-22T15:46:34Z] Chose failover target ()
INFO[2018-10-22T15:46:34Z] Promoting failover replica...
ERRO[2018-10-22T15:46:34Z] An error occurred while promoting the failover replica
ERRO[2018-10-22T15:46:34Z] Failover process failed: could not get pod info: resource name may not be empty
INFO[2018-10-22T15:46:34Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-10-22T15:46:44Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:54Z] dial tcp 10.105.143.178:5432: i/o timeout
I am using kubernetes
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
currently if you do not specify a valid namespace or a namespace env var at all, watch
will default to using the 'default' NAMESPACE which is not what we want.
instead, watch should check for the NAMESPACE to be set, if not, it should produce an error message in the log and abort.
func (h failoverHandler) SetFlags(f *flag.FlagSet) {
flags.String(f, KubeNamespace, "default")
the copyrights need fixing to match the correct format of:
a file gets created in 2016 so it would state "2016-2018"
a file gets created in 2017 it would state "2017-2018"
a file gets created in 2018 it would state "2018"
pause
and inFailOver
are both only true when actively failing over, so can't we remove pause
and just use inFailOver
to signal that condition?
evaluate ocp privs and what is the minimal reqts, where possible deescalate so that cluster role is not required to run watch.
using primary-replica create pr-primary, pr-replica and pr-replica-2 pods
run watch to monitor and switch master in case of failure.
Kill the pr-primary. watcher identifies the master failure and promotes pr-replica as master.
After this we can insert/delete database entries ( working file as expected)
now kill the pr-replica ( labelled as pr-primary after original pr-primary is killed)
watcher does not initiate failover
watcher logs for 1st failover(successful) and 2nd failover (does not failover)
INFO[2018-08-06T10:25:14Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:25:44Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:25:54Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:25:54Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:25:54Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-08-06T10:25:54Z] Processing Failover: Strategy - latest
INFO[2018-08-06T10:25:54Z] Deleting existing primary...
INFO[2018-08-06T10:25:54Z] Deleted old primary
INFO[2018-08-06T10:25:54Z] Choosing failover replica...
INFO[2018-08-06T10:25:54Z] Chose failover target (pr-replica)
INFO[2018-08-06T10:25:54Z] Promoting failover replica...
DEBU[2018-08-06T10:25:54Z] executing cmd: [/opt/cpm/bin/promote.sh] on pod pr-re plica in namespace default container: postgres
INFO[2018-08-06T10:25:54Z] Relabeling failover replica...
DEBU[2018-08-06T10:25:54Z] label: name
DEBU[2018-08-06T10:25:54Z] label: replicatype
INFO[2018-08-06T10:25:54Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-08-06T10:26:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:26:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:28:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:54Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:04Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:04Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:29:34Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:44Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:44Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:30:14Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:30:24Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:30:24Z] Could not reach 'pr-primary' (Attempt: 1)
Complete docker plugin support.
have crunchy-watch watch a statefulset cluster and perform a failover
this feature would let crunchy-watch support a manual failover...perhaps a REST API...another application or an end user using curl for instance might want to cause a manual failover for schedule maintenance or other...they need an API whereby to invoke this function.
Currently, only in cluster configurations are supported. It would be nice to also have out of cluster support as well. This will help primarily with development testing.
use the client-go API to replace the need to embed the kubectl and oc binaries
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.