The jobmon from binarybana

JobMon

JobMon is a distributed job manager designed to ease the process of running parallel simulations on heterogenous computing infrastructure (groups of clusters and workstations).

Note: This package is probably not suitable for you. I wrote it myself for a specialized scenario and I would recommend many of the other more mature Python distributed task managers that exist (like Celery).

Building Blocks

This software would not be possible without the following components:

Python for the main driving and glue code
Redis for the distributed job management
rsyslog for remote logging
ZeroMQ for some networking bliss
Pytest for showing me how testing in Python can be

Installation

Tested in a clean Ubuntu 14.04.1 VM:

sudo apt-get install redis-server python-pip git libzmq3 libzmq3-dev python-dev
pip install git+git://github.com/binarybana/jobmon.git

And if you'd like to run tests, or develop localy, then you'll need to clone from github directly:

$ git clone https://github.com/binarybana/jobmon.git
$ cd jobmon
$ pip install -e . # will install dependencies and a local development copy
$ py.test

============================= test session starts ==========================
platform linux2 -- Python 2.7.5 -- py-1.4.26 -- pytest-2.6.4
collected 2 items

tests/test_redis.py ..

=========================== 2 passed in 0.06 seconds =======================

Usage

First setup the configuration script jobmon/config.py to add the details about the python environment, number of processes, etc. in each cluster/workstation that you wish to run jobs on. These jobs will then be spawned through an SSH connection.

Currently this process is quite fragile, and there are other Python job servers that spawn through SSH which are probably better for you. So caveat emptor!

Then startup the forking daemon:

jm spawn

After that, synchronize your local codebase with remote codebases using:

jm sync [hosts [...]]

And then launch the worker monitor daemons remotely with:

jm launch [hosts [...]]

Monitor the cluster status with:

jm net

Submit jobs to the cluster with:

jm postjob <module.py> <job description> [Number of tasks to run]

Monitor their progress with:

jm jobs

Run parameter sweeps across a range of paramemters:

jm postsweep
jm post2Dsweep

And view and cleanup posted experiment files with:

jm source
jm clean
jm gc

Redis Schema

The Redis schema we are using:

jobs:new,jobs:working: List: <jobhash>|<paramhash> experiment strings.
jobs:numdone: Int: The number of done jobs.
jobs:sources: Hashmap: A hashmap from job hash to gzipped source text.
jobs:descs: Hashmap: <jobhash> -> Description string
params:sources: Hashmap: from hashed JSON params to the JSON param string.
jobs:times,experiments:times: Hashmap: from <jobhash> or <jobhash>|<paramhash> strings to unix epoch times of submission.
experiments:ground: Hashmap: <jobshash>|<paramhash> strings to the zlib compressed pickle of the ground truth object
jobs:githashes: Hashmap: jobhash to the githash of the superproject that the job was posted under
workers:hbs: Sorted Hashmap (heap): heartbeat times to json encoded info about the child. (See next section)
workers:stop: A key to indicate that the workers should stop.

Heartbeat Schema

The heartbeats that the children will send back have yet to be defined, but the JSON will probably include:

unique_id: as generated by the child
status: text describing current state (working, resting)
history: time at that state (seconds)
job: if working, what job? (if resting, then blank)

binarybana / jobmon Goto Github PK

jobmon's Introduction

JobMon

Building Blocks

Installation

Usage

Redis Schema

Heartbeat Schema

jobmon's People

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent