The crunchy-watch from crunchydata

handle container restart scenario

in testing watch, we need a means to cause Postgres to not start up, this is a different test than just killing the primary pod to cause the failover.

in this scenario, watch will detect it can't reach postgres even if Kube restarts the pod over and over.

Docker Module not Built - Cannot Use without Kube

The Makefile does not have a build path for building the docker.so module for this container.

Therefor use of attempted use of CRUNCHY_WATCH_PLATFORM=docker results in a Seg Fault.

pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | plugin.Open(/opt/cpm/bin/crunchy-watch/plugins/docker.so): realpath failed
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | panic: runtime error: invalid memory address or nil pointer dereference
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1571ebe]
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    |
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | goroutine 1 [running]:
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | main.loadPlatformModule(0x7ffc0778ae27, 0x6, 0xc4200dbe08, 0x1)
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | 	/home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/util.go:42 +0x1de
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | main.main()
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | 	/home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/main.go:229 +0x316

Panic if PROJECT varaible not set

There is a logic error in the main script that causes a panic if KUBE_PROJECT or OSE_PROJECT isn't set.

Starting crunchy watch
Started...
env var debug value is []
panic: runtime error: index out of range

goroutine 1 [running]:
main.main()
	/opt/cdev/src/github.com/crunchydata/crunchy-watch/main.go:218 +0x1237

Correct me if I'm wrong but os.Args always returns at least one value. This should be < 2.

The Dockerfile run script should also enforce this variable being set.

delete primary pod by label

Hi, thanks for very useful containers. Unfortunately I have a problem with deletion of pod which is primary and crunchy-watch would like to exchange it. First step is to delete old primary pod and this part does not work ending with: Error from server: pods "name=postgresql-primary" not found.

I found code responsible for that:

crunchy-watch/plugins/openshift/main.go

Line 93 in 703256e

fmt.Sprintf("name=%s", config.GetString("CRUNCHY_WATCH_PRIMARY")),

When I test it from cli: oc delete pod "name=postgres-primary" it didn't work either.
The remedy for this issue is to add -l flag which allows pod deletion by label, e.g., oc delete pod -l "name=postgres-primary".

support Kube Deployments as the targets

instead of just Pods, support killing a Primary as part of a Deployment.

remove binaries from container image

after client-go is implmented, make sure oc and kubectl binaries are removed

convert all examples to use Deployments for watch

instead of running watch in a simple pod, run it within a Kube Deployment, in all examples.

Need to ensure that inFailover is reset

crunchy-watch/main.go

Line 318 in d8f8f72

var inFailOver int32 = 0

see #32 for fix

Need to reset the inFailOver flag

crunchy-watch/main.go

Line 322 in 242720c

if atomic.CompareAndSwapInt32(&inFailOver, 0, 1) == false {

add DEBUG env var

setting an env var named DEBUG should cause log.Debug messages to be produced in the container log.

Please support SSL

The connection constructors are currently hardcoded to not use SSL:

crunchy-watch/main.go

Lines 266 to 282 in c0e8a18

    
           // Construct connection string to primary 
        
           target := fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d", 
        
           	config.GetString(Username.EnvVar), 
        
           	config.GetString(Password.EnvVar), 
        
           	config.GetString(Primary.EnvVar), 
        
           	config.GetInt(PrimaryPort.EnvVar), 
        
           	config.GetString(Database.EnvVar), 
        
           	int(timeout.Seconds()), 
        
           ) 
        
           pgconstr = fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d", 
        
           	config.GetString("postgres"), 
        
           	config.GetString(Password.EnvVar), 
        
           	config.GetString(Primary.EnvVar), 
        
           	config.GetInt(PrimaryPort.EnvVar), 
        
           	config.GetString(Database.EnvVar), 
        
           	int(timeout.Seconds()), 
        
           )

Please consider changing this to allow or require SSL.

Thanks!

Crunchy-watch : Failover not happening.

below are the logs when primary is deleted -
Starting crunchy watch
Started...
INFO[2018-10-22T15:44:54Z] Loading Platform Module: kube
INFO[2018-10-22T15:44:54Z] Waiting for signal...
INFO[2018-10-22T15:44:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:44:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:24Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:34Z] dial tcp 10.105.143.178:5432: i/o timeout
ERRO[2018-10-22T15:46:34Z] Could not reach 'crunchy-primary' (Attempt: 1)
INFO[2018-10-22T15:46:34Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-10-22T15:46:34Z] Processing Failover: Strategy - latest
INFO[2018-10-22T15:46:34Z] Deleting existing primary...
ERRO[2018-10-22T15:46:34Z] pods is forbidden: User "system:serviceaccount:default:pg-watcher" cannot deletecollection pods in the namespace "default"
ERRO[2018-10-22T15:46:34Z] An error occurred while deleting the old primary
INFO[2018-10-22T15:46:34Z] Deleted old primary
INFO[2018-10-22T15:46:34Z] Choosing failover replica...
ERRO[2018-10-22T15:46:34Z] Error getting pods command
INFO[2018-10-22T15:46:34Z] Chose failover target ()
INFO[2018-10-22T15:46:34Z] Promoting failover replica...
ERRO[2018-10-22T15:46:34Z] An error occurred while promoting the failover replica
ERRO[2018-10-22T15:46:34Z] Failover process failed: could not get pod info: resource name may not be empty
INFO[2018-10-22T15:46:34Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-10-22T15:46:44Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:54Z] dial tcp 10.105.143.178:5432: i/o timeout

I am using kubernetes
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

default namespace is not a valid default namespace

currently if you do not specify a valid namespace or a namespace env var at all, watch
will default to using the 'default' NAMESPACE which is not what we want.

instead, watch should check for the NAMESPACE to be set, if not, it should produce an error message in the log and abort.

func (h failoverHandler) SetFlags(f *flag.FlagSet) {
flags.String(f, KubeNamespace, "default")

fix copyrights

the copyrights need fixing to match the correct format of:

a file gets created in 2016 so it would state "2016-2018"

a file gets created in 2017 it would state "2017-2018"

a file gets created in 2018 it would state "2018"

Why do both pause and inFailOver exist?

pause and inFailOver are both only true when actively failing over, so can't we remove pause and just use inFailOver to signal that condition?

evaluate desescalation of privs on openshift

evaluate ocp privs and what is the minimal reqts, where possible deescalate so that cluster role is not required to run watch.

Unable to switch from replica ( promoted as master when master had failed) to replica-2 when replica is failed

using primary-replica create pr-primary, pr-replica and pr-replica-2 pods
run watch to monitor and switch master in case of failure.

Kill the pr-primary. watcher identifies the master failure and promotes pr-replica as master.

After this we can insert/delete database entries ( working file as expected)

now kill the pr-replica ( labelled as pr-primary after original pr-primary is killed)
watcher does not initiate failover
watcher logs for 1st failover(successful) and 2nd failover (does not failover)

INFO[2018-08-06T10:25:14Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:25:44Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:25:54Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:25:54Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:25:54Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-08-06T10:25:54Z] Processing Failover: Strategy - latest
INFO[2018-08-06T10:25:54Z] Deleting existing primary...
INFO[2018-08-06T10:25:54Z] Deleted old primary
INFO[2018-08-06T10:25:54Z] Choosing failover replica...
INFO[2018-08-06T10:25:54Z] Chose failover target (pr-replica)
INFO[2018-08-06T10:25:54Z] Promoting failover replica...
DEBU[2018-08-06T10:25:54Z] executing cmd: [/opt/cpm/bin/promote.sh] on pod pr-re plica in namespace default container: postgres
INFO[2018-08-06T10:25:54Z] Relabeling failover replica...
DEBU[2018-08-06T10:25:54Z] label: name
DEBU[2018-08-06T10:25:54Z] label: replicatype
INFO[2018-08-06T10:25:54Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-08-06T10:26:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:26:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:28:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:54Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:04Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:04Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:29:34Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:44Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:44Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:30:14Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:30:24Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:30:24Z] Could not reach 'pr-primary' (Attempt: 1)

docker support

Complete docker plugin support.

support failover within a statefulset configuration

have crunchy-watch watch a statefulset cluster and perform a failover

Need to check to make sure the array is not empty

crunchy-watch/plugins/kube/failover_strategy.go

Line 130 in 242720c

selectedReplica := replicas[0]

support a manual failover being initiated by a user or application

this feature would let crunchy-watch support a manual failover...perhaps a REST API...another application or an end user using curl for instance might want to cause a manual failover for schedule maintenance or other...they need an API whereby to invoke this function.

Support in and out of cluster configs for kube/openshift

Currently, only in cluster configurations are supported. It would be nice to also have out of cluster support as well. This will help primarily with development testing.

replace kubectl and oc with client-go API calls

use the client-go API to replace the need to embed the kubectl and oc binaries

	// Construct connection string to primary
	target := fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d",
	config.GetString(Username.EnvVar),
	config.GetString(Password.EnvVar),
	config.GetString(Primary.EnvVar),
	config.GetInt(PrimaryPort.EnvVar),
	config.GetString(Database.EnvVar),
	int(timeout.Seconds()),
	)
	pgconstr = fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d",
	config.GetString("postgres"),
	config.GetString(Password.EnvVar),
	config.GetString(Primary.EnvVar),
	config.GetInt(PrimaryPort.EnvVar),
	config.GetString(Database.EnvVar),
	int(timeout.Seconds()),
	)

crunchydata / crunchy-watch Goto Github PK

crunchy-watch's People

Contributors

Stargazers

Watchers

Forkers

crunchy-watch's Issues

Recommend Projects

Recommend Topics

Recommend Org