Giter Site home page Giter Site logo

jobmon's Introduction

JobMon

JobMon is a distributed job manager designed to ease the process of running parallel simulations on heterogenous computing infrastructure (groups of clusters and workstations).

Note: This package is probably not suitable for you. I wrote it myself for a specialized scenario and I would recommend many of the other more mature Python distributed task managers that exist (like Celery).

Building Blocks

This software would not be possible without the following components:

  • Python for the main driving and glue code
  • Redis for the distributed job management
  • rsyslog for remote logging
  • ZeroMQ for some networking bliss
  • Pytest for showing me how testing in Python can be

Installation

Tested in a clean Ubuntu 14.04.1 VM:

sudo apt-get install redis-server python-pip git libzmq3 libzmq3-dev python-dev
pip install git+git://github.com/binarybana/jobmon.git

And if you'd like to run tests, or develop localy, then you'll need to clone from github directly:

$ git clone https://github.com/binarybana/jobmon.git
$ cd jobmon
$ pip install -e . # will install dependencies and a local development copy
$ py.test

============================= test session starts ==========================
platform linux2 -- Python 2.7.5 -- py-1.4.26 -- pytest-2.6.4
collected 2 items

tests/test_redis.py ..

=========================== 2 passed in 0.06 seconds =======================

Usage

First setup the configuration script jobmon/config.py to add the details about the python environment, number of processes, etc. in each cluster/workstation that you wish to run jobs on. These jobs will then be spawned through an SSH connection.

Currently this process is quite fragile, and there are other Python job servers that spawn through SSH which are probably better for you. So caveat emptor!

Then startup the forking daemon:

jm spawn

After that, synchronize your local codebase with remote codebases using:

jm sync [hosts [...]]

And then launch the worker monitor daemons remotely with:

jm launch [hosts [...]]

Monitor the cluster status with:

jm net

Submit jobs to the cluster with:

jm postjob <module.py> <job description> [Number of tasks to run]

Monitor their progress with:

jm jobs

Run parameter sweeps across a range of paramemters:

jm postsweep
jm post2Dsweep

And view and cleanup posted experiment files with:

jm source
jm clean
jm gc

Redis Schema

The Redis schema we are using:

jobs:new,jobs:working
List: <jobhash>|<paramhash> experiment strings.
jobs:numdone
Int: The number of done jobs.
jobs:sources
Hashmap: A hashmap from job hash to gzipped source text.
jobs:descs
Hashmap: <jobhash> -> Description string
params:sources
Hashmap: from hashed JSON params to the JSON param string.
jobs:times,experiments:times
Hashmap: from <jobhash> or <jobhash>|<paramhash> strings to unix epoch times of submission.
experiments:ground
Hashmap: <jobshash>|<paramhash> strings to the zlib compressed pickle of the ground truth object
jobs:githashes
Hashmap: jobhash to the githash of the superproject that the job was posted under
workers:hbs
Sorted Hashmap (heap): heartbeat times to json encoded info about the child. (See next section)
workers:stop
A key to indicate that the workers should stop.

Heartbeat Schema

The heartbeats that the children will send back have yet to be defined, but the JSON will probably include:

  • unique_id: as generated by the child
  • status: text describing current state (working, resting)
  • history: time at that state (seconds)
  • job: if working, what job? (if resting, then blank)

jobmon's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.