Giter Site home page Giter Site logo

crunchy-watch's People

Contributors

abrightwell avatar andrewlecuyer avatar cahoonpwork avatar crunchyheath avatar crunchyjohn avatar davecramer avatar jasonodonnell avatar jmccormick2001 avatar steve-hetzel avatar xenophenes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crunchy-watch's Issues

handle container restart scenario

in testing watch, we need a means to cause Postgres to not start up, this is a different test than just killing the primary pod to cause the failover.

in this scenario, watch will detect it can't reach postgres even if Kube restarts the pod over and over.

Docker Module not Built - Cannot Use without Kube

The Makefile does not have a build path for building the docker.so module for this container.

Therefor use of attempted use of CRUNCHY_WATCH_PLATFORM=docker results in a Seg Fault.

pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | plugin.Open(/opt/cpm/bin/crunchy-watch/plugins/docker.so): realpath failed
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | panic: runtime error: invalid memory address or nil pointer dereference
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1571ebe]
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    |
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | goroutine 1 [running]:
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | main.loadPlatformModule(0x7ffc0778ae27, 0x6, 0xc4200dbe08, 0x1)
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | 	/home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/util.go:42 +0x1de
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | main.main()
pgwatch_pgwatcher.1.25qgkikv4am8@stldopsdock02    | 	/home/dwagneradm/buildenv/cdev/src/github.com/crunchydata/crunchy-watch/main.go:229 +0x316

Panic if PROJECT varaible not set

There is a logic error in the main script that causes a panic if KUBE_PROJECT or OSE_PROJECT isn't set.

Starting crunchy watch
Started...
env var debug value is []
panic: runtime error: index out of range

goroutine 1 [running]:
main.main()
	/opt/cdev/src/github.com/crunchydata/crunchy-watch/main.go:218 +0x1237

Correct me if I'm wrong but os.Args always returns at least one value. This should be < 2.

The Dockerfile run script should also enforce this variable being set.

delete primary pod by label

Hi, thanks for very useful containers. Unfortunately I have a problem with deletion of pod which is primary and crunchy-watch would like to exchange it. First step is to delete old primary pod and this part does not work ending with: Error from server: pods "name=postgresql-primary" not found.

I found code responsible for that:

fmt.Sprintf("name=%s", config.GetString("CRUNCHY_WATCH_PRIMARY")),

When I test it from cli: oc delete pod "name=postgres-primary" it didn't work either.
The remedy for this issue is to add -l flag which allows pod deletion by label, e.g., oc delete pod -l "name=postgres-primary".

add DEBUG env var

setting an env var named DEBUG should cause log.Debug messages to be produced in the container log.

Please support SSL

The connection constructors are currently hardcoded to not use SSL:

crunchy-watch/main.go

Lines 266 to 282 in c0e8a18

// Construct connection string to primary
target := fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d",
config.GetString(Username.EnvVar),
config.GetString(Password.EnvVar),
config.GetString(Primary.EnvVar),
config.GetInt(PrimaryPort.EnvVar),
config.GetString(Database.EnvVar),
int(timeout.Seconds()),
)
pgconstr = fmt.Sprintf("postgresql://%s:%s@%s:%d/%s?sslmode=disable&connect_timeout=%d",
config.GetString("postgres"),
config.GetString(Password.EnvVar),
config.GetString(Primary.EnvVar),
config.GetInt(PrimaryPort.EnvVar),
config.GetString(Database.EnvVar),
int(timeout.Seconds()),
)

Please consider changing this to allow or require SSL.

Thanks!

Crunchy-watch : Failover not happening.

below are the logs when primary is deleted -
Starting crunchy watch
Started...
INFO[2018-10-22T15:44:54Z] Loading Platform Module: kube
INFO[2018-10-22T15:44:54Z] Waiting for signal...
INFO[2018-10-22T15:44:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:44:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:24Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:34Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:44Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:45:54Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:04Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Health Checking: 'crunchy-primary'
INFO[2018-10-22T15:46:14Z] Successfully reached 'crunchy-primary'
INFO[2018-10-22T15:46:24Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:34Z] dial tcp 10.105.143.178:5432: i/o timeout
ERRO[2018-10-22T15:46:34Z] Could not reach 'crunchy-primary' (Attempt: 1)
INFO[2018-10-22T15:46:34Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-10-22T15:46:34Z] Processing Failover: Strategy - latest
INFO[2018-10-22T15:46:34Z] Deleting existing primary...
ERRO[2018-10-22T15:46:34Z] pods is forbidden: User "system:serviceaccount:default:pg-watcher" cannot deletecollection pods in the namespace "default"
ERRO[2018-10-22T15:46:34Z] An error occurred while deleting the old primary
INFO[2018-10-22T15:46:34Z] Deleted old primary
INFO[2018-10-22T15:46:34Z] Choosing failover replica...
ERRO[2018-10-22T15:46:34Z] Error getting pods command
INFO[2018-10-22T15:46:34Z] Chose failover target ()
INFO[2018-10-22T15:46:34Z] Promoting failover replica...
ERRO[2018-10-22T15:46:34Z] An error occurred while promoting the failover replica
ERRO[2018-10-22T15:46:34Z] Failover process failed: could not get pod info: resource name may not be empty
INFO[2018-10-22T15:46:34Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-10-22T15:46:44Z] Health Checking: 'crunchy-primary'
ERRO[2018-10-22T15:46:54Z] dial tcp 10.105.143.178:5432: i/o timeout

I am using kubernetes
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:53:20Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

default namespace is not a valid default namespace

currently if you do not specify a valid namespace or a namespace env var at all, watch
will default to using the 'default' NAMESPACE which is not what we want.

instead, watch should check for the NAMESPACE to be set, if not, it should produce an error message in the log and abort.

func (h failoverHandler) SetFlags(f *flag.FlagSet) {
flags.String(f, KubeNamespace, "default")

fix copyrights

the copyrights need fixing to match the correct format of:

a file gets created in 2016 so it would state "2016-2018"

a file gets created in 2017 it would state "2017-2018"

a file gets created in 2018 it would state "2018"

Unable to switch from replica ( promoted as master when master had failed) to replica-2 when replica is failed

using primary-replica create pr-primary, pr-replica and pr-replica-2 pods
run watch to monitor and switch master in case of failure.

Kill the pr-primary. watcher identifies the master failure and promotes pr-replica as master.

After this we can insert/delete database entries ( working file as expected)

now kill the pr-replica ( labelled as pr-primary after original pr-primary is killed)
watcher does not initiate failover
watcher logs for 1st failover(successful) and 2nd failover (does not failover)

INFO[2018-08-06T10:25:14Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:25:44Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:25:54Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:25:54Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:25:54Z] Executing pre-hook: /hooks/watch-pre-hook
INFO[2018-08-06T10:25:54Z] Processing Failover: Strategy - latest
INFO[2018-08-06T10:25:54Z] Deleting existing primary...
INFO[2018-08-06T10:25:54Z] Deleted old primary
INFO[2018-08-06T10:25:54Z] Choosing failover replica...
INFO[2018-08-06T10:25:54Z] Chose failover target (pr-replica)
INFO[2018-08-06T10:25:54Z] Promoting failover replica...
DEBU[2018-08-06T10:25:54Z] executing cmd: [/opt/cpm/bin/promote.sh] on pod pr-re plica in namespace default container: postgres
INFO[2018-08-06T10:25:54Z] Relabeling failover replica...
DEBU[2018-08-06T10:25:54Z] label: name
DEBU[2018-08-06T10:25:54Z] label: replicatype
INFO[2018-08-06T10:25:54Z] Executing post-hook: /hooks/watch-post-hook
INFO[2018-08-06T10:26:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:26:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:26:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:27:54Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:27:54Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:24Z] Health Checking: 'pr-primary'
INFO[2018-08-06T10:28:24Z] Successfully reached 'pr-primary'
INFO[2018-08-06T10:28:54Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:04Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:04Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:29:34Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:29:44Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:29:44Z] Could not reach 'pr-primary' (Attempt: 1)
INFO[2018-08-06T10:30:14Z] Health Checking: 'pr-primary'
ERRO[2018-08-06T10:30:24Z] dial tcp 10.96.29.55:5432: i/o timeout
ERRO[2018-08-06T10:30:24Z] Could not reach 'pr-primary' (Attempt: 1)

support a manual failover being initiated by a user or application

this feature would let crunchy-watch support a manual failover...perhaps a REST API...another application or an end user using curl for instance might want to cause a manual failover for schedule maintenance or other...they need an API whereby to invoke this function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.