Giter Site home page Giter Site logo

saga-project / bigjob Goto Github PK

View Code? Open in Web Editor NEW
19.0 19.0 8.0 84.34 MB

SAGA-based Pilot-Job Implementation for Compute and Data

Home Page: http://saga-project.github.com/BigJob/

License: Other

Python 76.43% Shell 0.09% Makefile 0.02% Jupyter Notebook 23.46%

bigjob's People

Contributors

andre-merzky avatar drelu avatar icheckmate avatar melrom avatar oleweidner avatar pradeepmantha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigjob's Issues

Parallelism Option in P* Compute API

        # Parallelism
        'number_of_processes',      # Total number of processes to start
        'processes_per_host',       # Nr of processes per host
        'threads_per_process',      # Nr of threads to start per process
        'total_core_count',         # total number of cores requested
        'spmd_variation',           # Type and startup mechanism

This has been a constant source of annoyance and misunderstanding in SAGA. I strongly suggest that this something that we should not adopt from it!

number_of_processes jd attribute is ignored or is always "1"

Hi!

I think the number_of_processes is being ignored and is always 1 for both single or mpi jobs. Please see the job descripiton populated in agent log file below.

The example I used is

(mypython_repo)login1$ cat example_local_single.py
""" Example application demonstrating job submission via bigjob

DON'T EDIT THIS FILE (UNLESS THERE IS A BUG)

THIS FILE SHOULD NOT BE COMMITTED TO SVN WITH USER-SPECIFIC PATHS!

"""
import os
import time
import pdb
import sys

configuration

""" This variable defines the coordination system that is used by BigJob
e.g.
advert://localhost (SAGA/Advert SQLITE)
advert://advert.cct.lsu.edu:8080 (SAGA/Advert POSTGRESQL)
redis://localhost:6379 (Redis at localhost)
tcp://localhost (ZMQ)
tcp://* (ZMQ - listening to all interfaces)
"""

COORDINATION_URL = "advert://localhost/?dbtype=sqlite3"

COORDINATION_URL = "tcp://*"

COORDINATION_URL = "redis://localhost:6379"

COORDINATION_URL = "redis://[email protected]:6379"

for running BJ from local dir

sys.path.insert(0, os.getcwd() + "/../")

from bigjob import bigjob, subjob, description

def main():
# Start BigJob

##########################################################################################
# Edit parameters for BigJob
queue=None # if None default queue is used
project=None # if None default allocation is used 
walltime=10
processes_per_node=4
number_of_processes = 8
workingdirectory=os.path.join(os.getcwd(), "agent")  # working directory for agent
userproxy = None # userproxy (not supported yet due to context issue w/ SAGA)


""" 
URL of the SAGA Job Service that is used to dispatch the pilot job.
The following URLs are accepted:

lrms_url = "gram://oliver1.loni.org/jobmanager-pbs" # globus resource url used when globus is used. (LONI)
lrms_url = "pbspro://louie1.loni.org" # pbspro resource url used when pbspro scheduling system is used.(Futuregrid or LSU Machines)
lrms_url = "ssh://louie1.loni.org" # ssh resource url which launches jobs on target machine. Jobs not submitted to scheduling system.
lrms_url = "pbs-ssh://louie1.loni.org" # Submit jobs to scheduling system of remote machine.
lrms_url = "xt5torque://localhost" # torque resource url.

Please ensure that the respective SAGA adaptor is installed and working
"""
lrms_url = "fork://localhost" # resource url to run the jobs on localhost

##########################################################################################

print "Start Pilot Job/BigJob at: " + lrms_url
bj = bigjob(COORDINATION_URL)
bj.start_pilot_job( lrms_url,
                    None,
                    number_of_processes,
                    queue,
                    project,
                    workingdirectory,
                    userproxy,
                    walltime,
                    processes_per_node)

print "Pilot Job/BigJob URL: " + bj.pilot_url + " State: " + str(bj.get_state())

##########################################################################################
# Submit SubJob through BigJob
jd = description()
jd.executable = "/bin/echo"
#jd.executable = "$HOME/hello.sh"
jd.number_of_processes = "4"
jd.arguments = ["$HELLOWORLD"]
jd.environment = ['HELLOWORLD=hello_world']
jd.spmd_variation="single"

# specify an optinal working directory if sub-job should be executed outside of bigjob sandbox
#jd.working_directory = "/tmp" 
jd.output = "stdout.txt"
jd.error = "stderr.txt"
sj = subjob()
sj.submit_job(bj.pilot_url, jd)

#########################################
# busy wait for completion
while 1:
    state = str(sj.get_state())
    print "state: " + state
    if(state=="Failed" or state=="Done"):
        break
    time.sleep(2)

##########################################################################################
# Cleanup - stop BigJob
bj.cancel()
#time.sleep(30)

""" Test Job Submission via Advert """
if name == "main":
main()

(mypython_repo)login1$ cat stderr-bj-ef77aad0-bf16-11e1-aa15-f04da2005de7-agent.txt
06/25/2012 05:41:44 PM - bigjob - INFO - Loading BigJob version: 0.4.70 on login1.ls4.tacc.utexas.edu
06/25/2012 05:41:44 PM - bigjob - WARNING - SAGA C++ and Python bindings not found. Using Bliss.
06/25/2012 05:41:45 PM - root - WARNING - SAGA could not be found. Not all functionalities working
06/25/2012 05:41:45 PM - bigjob - DEBUG - Python Version: sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
06/25/2012 05:41:45 PM - bigjob - DEBUG - Create output tar: False
/bin/sh: aprun: command not found
06/25/2012 05:41:45 PM - bigjob - DEBUG - aprun: False ssh: True Launch method: ssh
06/25/2012 05:41:45 PM - bigjob - DEBUG - External queue:
06/25/2012 05:41:45 PM - bigjob - DEBUG - parsing ID out of URL: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost
06/25/2012 05:41:45 PM - bigjob - DEBUG - BigJob Agent arguments: ['bigjob_agent.py', 'redis://[email protected]:6379', 'bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost', '']
06/25/2012 05:41:45 PM - bigjob - DEBUG - Initialize C&C subsystem to pilot-url: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost
06/25/2012 05:41:45 PM - bigjob - DEBUG - BigJob ID: bj-ef77aad0-bf16-11e1-aa15-f04da2005de7
06/25/2012 05:41:45 PM - bigjob - DEBUG - Utilizing Redis Backend: redis://[email protected]:6379. Please make sure Redis server is configured in bigjob_coordination_redis.py
06/25/2012 05:41:45 PM - bigjob - DEBUG - Connect to Redis: gw68.quarry.iu.teragrid.org Port: 6379
06/25/2012 05:41:45 PM - bigjob - DEBUG - set state to : Running
06/25/2012 05:41:45 PM - bigjob - DEBUG - update state of pilot job to: Running stopped: False
06/25/2012 05:41:45 PM - bigjob - DEBUG - ##################################### New POLL/MONITOR cycle ##################################
06/25/2012 05:41:45 PM - bigjob - DEBUG - Free nodes: 12 Busy Nodes: 0
06/25/2012 05:41:46 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
06/25/2012 05:41:46 PM - bigjob - DEBUG - Pilot job entry: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost exists. Pilot job not in state stopped.
06/25/2012 05:41:46 PM - bigjob - DEBUG - Monitor jobs - # current jobs: 0
06/25/2012 05:41:46 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:queue number queued items: 1
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequeued: ('bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:queue', 'bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:jobs:sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7')
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequed:bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:jobs:sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7
06/25/2012 05:41:46 PM - bigjob - DEBUG - Get job description
06/25/2012 05:41:46 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost
06/25/2012 05:41:46 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:queue number queued items: 0
06/25/2012 05:41:46 PM - bigjob - DEBUG - start job: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:jobs:sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 data: {'Executable': '/bin/echo', 'NumberOfProcesses': '1', 'SPMDVariation': 'single', 'start_time': '1340664104.82', 'Environment': "['HELLOWORLD=hello_world']", 'state': 'Unknown', 'Arguments': "['$HELLOWORLD']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7', 'TotalCPUCount': '4'}
06/25/2012 05:41:46 PM - bigjob - DEBUG - set job state to: New
06/25/2012 05:41:46 PM - bigjob - DEBUG - Start job id sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 specification {'Executable': '/bin/echo', 'NumberOfProcesses': '1', 'SPMDVariation': 'single', 'start_time': '1340664104.82', 'Environment': "['HELLOWORLD=hello_world']", 'state': 'New', 'Arguments': "['$HELLOWORLD']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7', 'TotalCPUCount': '4'}:
06/25/2012 05:41:46 PM - bigjob - DEBUG - Environment: ['HELLOWORLD=hello_world']
06/25/2012 05:41:46 PM - bigjob - DEBUG - Eval HELLOWORLD=hello_world
06/25/2012 05:41:46 PM - bigjob - DEBUG - export HELLOWORLD=hello_world;
06/25/2012 05:41:46 PM - bigjob - DEBUG - Expanded directory: /bin/echo to /bin/echo
06/25/2012 05:41:46 PM - bigjob - DEBUG - stdout: /home1/01539/pmantha/BigJob/examples/agent/bj-ef77aad0-bf16-11e1-aa15-f04da2005de7/sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7/stdout.txt stderr: /home1/01539/pmantha/BigJob/examples/agent/bj-ef77aad0-bf16-11e1-aa15-f04da2005de7/sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7/stderr.txt
06/25/2012 05:41:46 PM - bigjob - DEBUG - allocate: localhost
number nodes: 12 current busy nodes: [] free nodes: ['localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n']
06/25/2012 05:41:46 PM - bigjob - DEBUG - wrote machinefile: /home1/01539/pmantha/advert-launcher-machines-sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 Nodes: ['localhost\n']
06/25/2012 05:41:46 PM - bigjob - DEBUG - execute: cd /home1/01539/pmantha/BigJob/examples/agent/bj-ef77aad0-bf16-11e1-aa15-f04da2005de7/sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7; export HELLOWORLD=hello_world; /bin/echo $HELLOWORLD in /home1/01539/pmantha/BigJob/examples/agent/bj-ef77aad0-bf16-11e1-aa15-f04da2005de7/sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 from: login1.ls4.tacc.utexas.edu (Shell: /bin/bash)
06/25/2012 05:41:46 PM - bigjob - DEBUG - started cd /home1/01539/pmantha/BigJob/examples/agent/bj-ef77aad0-bf16-11e1-aa15-f04da2005de7/sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7; export HELLOWORLD=hello_world; /bin/echo $HELLOWORLD
06/25/2012 05:41:46 PM - bigjob - DEBUG - set job state to: Running
06/25/2012 05:41:48 PM - bigjob - DEBUG - Dequed:None
06/25/2012 05:41:48 PM - bigjob - DEBUG - Dequeue sub-job from:
06/25/2012 05:41:48 PM - bigjob - DEBUG - Dequeue sub-job from: :queue number queued items: 0
06/25/2012 05:41:50 PM - bigjob - DEBUG - Dequed:None
06/25/2012 05:41:51 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
06/25/2012 05:41:51 PM - bigjob - DEBUG - Pilot job entry: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost exists. Pilot job not in state stopped.
06/25/2012 05:41:51 PM - bigjob - DEBUG - Monitor jobs - # current jobs: 1
06/25/2012 05:41:51 PM - bigjob - DEBUG - Job: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:jobs:sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 Excutable: /bin/echo state: 0 return code: 0
06/25/2012 05:41:51 PM - bigjob - DEBUG - Job successful: Job: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:jobs:sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7 Excutable: /bin/echo - set state to Done
06/25/2012 05:41:51 PM - bigjob - DEBUG - Create NO output.tar. Enable output.tar file creation in bigjob_agent.conf
06/25/2012 05:41:51 PM - bigjob - DEBUG - set job state to: Done
06/25/2012 05:41:51 PM - bigjob - DEBUG - Machine file: /home1/01539/pmantha/advert-launcher-machines-sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7
06/25/2012 05:41:51 PM - bigjob - DEBUG - Free nodes: ['localhost\n']
06/25/2012 05:41:51 PM - bigjob - DEBUG - free node: localhost
current busy nodes: ['localhost\n'] free nodes: ['localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n']
06/25/2012 05:41:51 PM - bigjob - DEBUG - Delete /home1/01539/pmantha/advert-launcher-machines-sj-f0bddbd0-bf16-11e1-aa15-f04da2005de7
06/25/2012 05:41:53 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
06/25/2012 05:41:53 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost
06/25/2012 05:41:53 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:queue number queued items: 0
06/25/2012 05:41:53 PM - bigjob - DEBUG - Dequeued: ('bigjob:bj-ef77aad0-bf16-11e1-aa15-f04da2005de7:localhost:queue', 'STOP')
06/25/2012 05:41:53 PM - bigjob - DEBUG - Dequed:STOP
06/25/2012 05:41:53 PM - bigjob - DEBUG - Terminating Agent - Dequeue Sub-Jobs Thread
06/25/2012 05:41:56 PM - bigjob - DEBUG - Pilot State: {}
06/25/2012 05:41:56 PM - bigjob - DEBUG - Pilot job entry deleted - terminate agent
06/25/2012 05:41:56 PM - bigjob - DEBUG - Terminating Agent - Background Thread

BigJob Doesn't Quit when it fails on agent directory

10/25/2012 10:15:14 AM - bigjob - DEBUG - BigJob working directory: ssh://[email protected]/N/u/melrom/agent/bj-c6478f36-1eb6-11e2-8eb7-a4badb0c3696
10/25/2012 10:15:15 AM - bigjob - DEBUG - Directory not found: /N/u/melrom/agent/bj-c6478f36-1eb6-11e2-8eb7-a4badb0c3696
10/25/2012 10:15:15 AM - bigjob - DEBUG - Create directory at: lonestar.tacc.utexas.edu
10/25/2012 10:15:16 AM - bigjob - ERROR - Error creating directory: /N/u/melrom/agent/bj-c6478f36-1eb6-11e2-8eb7-a4badb0c3696 at: lonestar.tacc.utexas.edu

If agent directory is not created, script goes into queue and reaches Error qw (Eqw) state. This causes the script to run in an infinite loop. BigJob should quit after a failure to access the agent directory and notify the user that they have entered an invalid agent directory.

compute_data_service.wait() doesn't until all units are completed.

Hi!

The wait function doesn't wait until all subjobs are completed. Not sure, whether advert entries created for all subjobs.
BigJob version used - 0.4.46
Machine - sierra

Below is the example script.

(python)-bash-3.2$ cat example-pilot-api.py
import sys
import pdb
import os
import time
import logging
logging.basicConfig(level=logging.DEBUG)

sys.path.append(os.path.join(os.path.dirname(file), "../.."))
sys.path.insert(0, os.getcwd() + "/../")
from pilot import PilotComputeService, ComputeDataService, State

if name == "main":

pilot_compute_service = PilotComputeService()

# create pilot job service and initiate a pilot job
pilot_compute_description = {
                         "service_url": 'fork://localhost',
                         "number_of_processes": 32,                             
                         "processes_per_node":8,
                         "walltime":60,
                         "working_directory": "/N/u/pmantha/agent",
                         'affinity_datacenter_label': "eu-de-south",              
                         'affinity_machine_label': "mymachine" 
                        }

pilot_compute_description2 = {
                         "service_url": 'pbs-ssh://india.futuregrid.org',
                         "number_of_processes": 32,
                         "processes_per_node":8,
                         "walltime":60,
                         "working_directory": "/N/u/pmantha/agent",
                         'affinity_datacenter_label': "eu-de-south",
                         'affinity_machine_label': "mymachine"
                        }

pilotjob = pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)
#pilotjob1 = pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description2)

compute_data_service = ComputeDataService()
compute_data_service.add_pilot_compute_service(pilot_compute_service)

# start work unit
for i in range(0,10):
    compute_unit_description = {
        "executable": "/bin/hostname",
        "arguments": [""],
        "total_core_count": 1,
        "number_of_processes": 1,            
        "output": "stdout.txt",
        "error": "stderr.txt",   
    }    
compute_unit = compute_data_service.submit_compute_unit(compute_unit_description)


compute_data_service.wait()

logging.debug("Finished setup. Waiting for scheduling of CU")
"""while compute_unit != State.Done:
    logging.debug("Check state")

    state_cu = compute_unit.get_state()
    print "PCS State %s" % pilot_compute_service
    print "CU: %s State: %s"%(compute_unit, state_cu)
    if state_cu==State.Done:
        break
    time.sleep(2) """

logging.debug("Terminate Pilot Compute and Compute Data Service")
compute_data_service.cancel()    
pilot_compute_service.cancel()

(python)-bash-3.2$

Below is the agent log file

(python)-bash-3.2$ cat stderr-bigjob_agent.txt
04/07/2012 03:18:05 PM - bigjob - INFO - Loading BigJob version: 0.4.46
04/07/2012 03:18:05 PM - bigjob - DEBUG - Using SAGA C++/Python.
04/07/2012 03:18:05 PM - root - DEBUG - ['/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/bliss-0.1.20-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/virtualenv-1.7.1.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages', '/N/u/pmantha/.bigjob/python/lib/python27.zip', '/N/u/pmantha/.bigjob/python/lib/python2.7', '/N/u/pmantha/.bigjob/python/lib/python2.7/plat-linux2', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-tk', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-old', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-dynload', '/N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/../', '', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools_git-0.4.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/bliss-0.1.17-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/openssh_wrapper-0.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/bliss-0.1.20-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/virtualenv-1.7.1.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/BigJob', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/soft/python/2.7/lib/python2.7/site-packages', '/N/u/pmantha/.bigjob/python/lib/python27.zip', '/N/u/pmantha/.bigjob/python/lib/python2.7', '/N/u/pmantha/.bigjob/python/lib/python2.7/plat-linux2', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-tk', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-old', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-dynload', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk', '/N/u/pmantha/.local/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/pmantha/BigJob/bigjob', '/N/u/pmantha/BigJob/bigjob/../../ext/threadpool-1.2.7/src/']
04/07/2012 03:18:05 PM - bigjob - DEBUG - Python Version: sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
04/07/2012 03:18:05 PM - root - DEBUG - read configfile: /N/u/pmantha/BigJob/bigjob/../bigjob_agent.conf
/bin/sh: aprun: command not found
Intel compiler suite version 11.1/072 loaded
OpenMPI version 1.4.2 loaded
04/07/2012 03:18:06 PM - bigjob - DEBUG - aprun: False ssh: True Launch method: ssh
04/07/2012 03:18:06 PM - root - DEBUG - Launch Method: ssh mpi: mpirun shell: /bin/bash
04/07/2012 03:18:06 PM - bigjob - DEBUG - parsing ID out of URL: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost
04/07/2012 03:18:06 PM - bigjob - DEBUG - BigJob Agent arguments: ['bigjob_agent.py', 'redis://[email protected]:6379', 'bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost']
04/07/2012 03:18:06 PM - bigjob - DEBUG - Initialize C&C subsystem to pilot-url: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost
04/07/2012 03:18:06 PM - bigjob - DEBUG - BigJob ID: bj-896426b0-80ff-11e1-8eed-002215124496
04/07/2012 03:18:06 PM - root - DEBUG - ['/N/u/pmantha/BigJob/coordination/../ext/redis-2.4.9/', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/bliss-0.1.20-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/virtualenv-1.7.1.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages', '/N/u/pmantha/.bigjob/python/lib/python27.zip', '/N/u/pmantha/.bigjob/python/lib/python2.7', '/N/u/pmantha/.bigjob/python/lib/python2.7/plat-linux2', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-tk', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-old', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-dynload', '/N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/../', '', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools_git-0.4.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/bliss-0.1.17-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/openssh_wrapper-0.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/setuptools-0.6c12dev_r88846-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/bliss-0.1.20-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/virtualenv-1.7.1.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/furl-0.3-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/openssh_wrapper-0.2.2-py2.7.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages/orderedmultidict-0.7-py2.7.egg', '/N/u/pmantha/BigJob', '/N/u/pmantha/.bigjob/python/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/soft/python/2.7/lib/python2.7/site-packages', '/N/u/pmantha/.bigjob/python/lib/python27.zip', '/N/u/pmantha/.bigjob/python/lib/python2.7', '/N/u/pmantha/.bigjob/python/lib/python2.7/plat-linux2', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-tk', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-old', '/N/u/pmantha/.bigjob/python/lib/python2.7/lib-dynload', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk', '/N/u/pmantha/.local/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/pmantha/BigJob/bigjob', '/N/u/pmantha/BigJob/bigjob/../../ext/threadpool-1.2.7/src/']
04/07/2012 03:18:06 PM - bigjob - DEBUG - Utilizing Redis Backend: redis://[email protected]:6379. Please make sure Redis server is configured in bigjob_coordination_redis.py
04/07/2012 03:18:06 PM - bigjob - DEBUG - Connect to Redis: gw68.quarry.iu.teragrid.org Port: 6379
04/07/2012 03:18:06 PM - bigjob - DEBUG - set state to : Running
04/07/2012 03:18:06 PM - bigjob - DEBUG - update state of pilot job to: Running
04/07/2012 03:18:07 PM - bigjob - DEBUG - ##################################### New POLL/MONITOR cycle ##################################
04/07/2012 03:18:07 PM - bigjob - DEBUG - Free nodes: 8 Busy Nodes: 0
04/07/2012 03:18:07 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/07/2012 03:18:07 PM - bigjob - DEBUG - Pilot job entry: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost exists. Pilot job not in state stopped.
04/07/2012 03:18:07 PM - bigjob - DEBUG - Monitor jobs - # current jobs: 0
04/07/2012 03:18:07 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:queue number queued items: 1
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequeued: ('bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:queue', 'bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:jobs:sj-8c7e4d8a-80ff-11e1-92e4-002215124496')
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequed:bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:jobs:sj-8c7e4d8a-80ff-11e1-92e4-002215124496
04/07/2012 03:18:07 PM - bigjob - DEBUG - Get job description
04/07/2012 03:18:07 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost
04/07/2012 03:18:07 PM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:queue number queued items: 0
04/07/2012 03:18:07 PM - bigjob - DEBUG - start job: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:jobs:sj-8c7e4d8a-80ff-11e1-92e4-002215124496 data: {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'Unknown', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-8c7e4d8a-80ff-11e1-92e4-002215124496', 'SPMDVariation': 'single'}
04/07/2012 03:18:07 PM - bigjob - DEBUG - set job state to: New
04/07/2012 03:18:08 PM - bigjob - DEBUG - Start job id sj-8c7e4d8a-80ff-11e1-92e4-002215124496 specification {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'New', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-8c7e4d8a-80ff-11e1-92e4-002215124496', 'SPMDVariation': 'single'}:
04/07/2012 03:18:08 PM - root - DEBUG - Sub-Job: sj-8c7e4d8a-80ff-11e1-92e4-002215124496, Working_directory: /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496
04/07/2012 03:18:08 PM - bigjob - DEBUG - stdout: /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496/stdout.txt stderr: /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496/stderr.txt
04/07/2012 03:18:08 PM - bigjob - DEBUG - allocate: localhost
number nodes: 8 current busy nodes: [] free nodes: ['localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n', 'localhost\n']
04/07/2012 03:18:08 PM - bigjob - DEBUG - wrote machinefile: /N/u/pmantha/advert-launcher-machines-sj-8c7e4d8a-80ff-11e1-92e4-002215124496 Nodes: ['localhost\n']
04/07/2012 03:18:08 PM - bigjob - DEBUG - execute: cd /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496; /bin/hostname in /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496 from: s1 (Shell: /bin/bash)
04/07/2012 03:18:08 PM - bigjob - DEBUG - started cd /N/u/pmantha/agent/bj-896426b0-80ff-11e1-8eed-002215124496/sj-8c7e4d8a-80ff-11e1-92e4-002215124496; /bin/hostname
04/07/2012 03:18:08 PM - bigjob - DEBUG - set job state to: Running
04/07/2012 03:18:12 PM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/07/2012 03:18:12 PM - bigjob - DEBUG - Pilot job entry: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost exists. Pilot job not in state stopped.
04/07/2012 03:18:12 PM - bigjob - DEBUG - Monitor jobs - # current jobs: 1
04/07/2012 03:18:12 PM - bigjob - DEBUG - Job: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:jobs:sj-8c7e4d8a-80ff-11e1-92e4-002215124496 Excutable: /bin/hostname state: 0 return code: 0
04/07/2012 03:18:12 PM - bigjob - DEBUG - Job successful: Job: bigjob:bj-896426b0-80ff-11e1-8eed-002215124496:localhost:jobs:sj-8c7e4d8a-80ff-11e1-92e4-002215124496 Excutable: /bin/hostname - set state to Done
04/07/2012 03:18:12 PM - bigjob - DEBUG - Update output file...
04/07/2012 03:18:12 PM - bigjob - DEBUG - Files: ['sj-8c7e4d8a-80ff-11e1-92e4-002215124496', 'stderr-bigjob_agent.txt', 'stdout-bigjob_agent.txt']
04/07/2012 03:18:12 PM - bigjob - DEBUG - set job state to: Done
(python)-bash-3.2$

faust.cct.lsu.edu still in easy_install

from Melissa: When I type easy_install bigjob, I get:

Reading http://faust.cct.lsu.edu/trac/bigjob
Download error: [Errno 111] Connection refused -- Some packages may
not be found!

This does not quit BigJob but it should be removed from the easy_install script.

Unable to control SPMDVariation during pilot_job submission

When I try to run BigJob via the Condor adaptor, I get the following error:

2012-01-04 01:24:49,684 - bigjob - DEBUG - Submit pilot job to: condor://localhost/
2012-01-04 01:24:49,691 - bigjob.server - ERROR - Exception: SAGA(BadParameter): condor_job: Problem launching condor job: (std::exception caught: SAGA(NotImplemented): condor_job: Condor adaptor does not support the 'SPMDVariation' attribute.

While this error comes from the condor adaptor (it doesn't support SPMDVariation), I can't find a way to unset jd.spmd_variation. It seems that it is set explicitly in bigjob/bigjob_manager.py:239.

Would it be possible to make this an option for the bigjob.start_pilot_job() method, give it a default value of "None" and don't set it at all in that case?

Is SPMDVariation variation relevant at all during pilot_job submission, or is it just set "for completeness"? In that case, we could remove it completely.

Ctrl+C Doesn't Seem To Work

I see BigJob 'hanging' at this point:

...
Finished Pilot-Job setup. Submitting compute units
Waiting for compute units to complete

In a different window, I watch the PBS queue status. Even though the pilot job has finished, the BigJob 'master' hangs at that point and won't progress to 'finish' state. When I hit CTRL-C, I get the following exception:

Traceback (most recent call last):
 File "bigjob-example.py", line 51, in <module>
   compute_data_service.wait()
 File "/N/u/oweidner/software/bigjob-bliss/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/impl/pilot_manager.py",     line 216, in wait
   i.wait()     
 File "/N/u/oweidner/software/bigjob-bliss/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/impl/pilot_manager.py", line 405, in wait
   time.sleep(2) 
KeyboardInterrupt

But BigJob keeps on hanging and I can't get back to the command prompt. Does BigJob use threads? It looks to me like a threading issue where threads are not properly joined.

The only way to get my shell back is to log-in via another SSH session and do a killall python

BigJob agent launch fails - when there is no bigjob installation on remote machine.

Hi!

I tried to launch BigJob from eric -> oliver. eric has latest BigJob version ( 0,4,61) and oliver doesn't have any BigJob installation..

Agent failed with below error.

BigJob agent output file:

New python executable in /home/pmantha/.bigjob/python/bin/python
Installing setuptools............................done.
Installing pip.....................done.
Searching for bigjob
Reading http://pypi.python.org/simple/bigjob/
Reading http://faust.cct.lsu.edu/trac/bigjob
Download error: [Errno 111] Connection refused -- Some packages may not be found!
Reading https://github.com/saga-project/BigJob
Best match: BigJob 0.4.61
Downloading http://pypi.python.org/packages/source/B/BigJob/BigJob-0.4.61.tar.gz#md5=2ac0ddb5ad4f7d34d6267052772cb1f7
Processing BigJob-0.4.61.tar.gz
Running BigJob-0.4.61/setup.py -q bdist_egg --dist-dir /tmp/easy_install-nU5TDe/BigJob-0.4.61/egg-dist-tmp-R6KS1J
warning: no files found matching '*' under directory 'package'
no previously-included directories found matching 'doc'
zip_safe flag not set; analyzing archive contents...
bigjob_dynamic.many_job: module references file
bigjob_dynamic.many_job_affinity: module references file
coordination.bigjob_coordination_zmq: module references file
coordination.bigjob_coordination_advert: module references file
coordination.bigjob_coordination_redis: module references file
bigjob.init: module references file
bigjob.bigjob_agent_condor: module references file
bigjob.bigjob_manager: module references file
bigjob.bigjob_agent: module references file
examples.example_local_multiple_reconnect: module references file
examples.example_local_single_filestaging: module references file
examples.example_single_filestaging_globusonline: module references file
pilot.impl.pilotdata_manager: module references file
pilot.filemanagement.globusonline_adaptor: module references file
pilot.filemanagement.ssh_adaptor: module references file
pilot.filemanagement.webhdfs_adaptor: module references file
bootstrap.bigjob-bootstrap: module references file
bootstrap.bigjob-bootstrap: module references path
Adding BigJob 0.4.61 to easy-install.pth file
Installing test-bigjob-dynamic script to /home/pmantha/.bigjob/python/bin
Installing test-bigjob script to /home/pmantha/.bigjob/python/bin

Installed /home/pmantha/.bigjob/python/lib/python2.6/site-packages/BigJob-0.4.61-py2.6.egg
Processing dependencies for bigjob
Searching for bliss
Reading http://pypi.python.org/simple/bliss/
Reading http://oweidner.github.com/bliss/
Reading http://saga-project.github.com/bliss/
Best match: bliss 0.1.20
Downloading http://pypi.python.org/packages/source/b/bliss/bliss-0.1.20.tar.gz#md5=f2f2e376b75b9750e77ce3cede488c18
Processing bliss-0.1.20.tar.gz
Running bliss-0.1.20/setup.py -q bdist_egg --dist-dir /tmp/easy_install-sl1o39/bliss-0.1.20/egg-dist-tmp-RvyTw2
zip_safe flag not set; analyzing archive contents...
bliss.init: module references file
Adding bliss 0.1.20 to easy-install.pth file

Installed /home/pmantha/.bigjob/python/lib/python2.6/site-packages/bliss-0.1.20-py2.6.egg
Searching for redis==2.2.4
Reading http://pypi.python.org/simple/redis/
Reading http://github.com/andymccurdy/redis-py
Best match: redis 2.2.4
Downloading http://cloud.github.com/downloads/andymccurdy/redis-py/redis-2.2.4.tar.gz
Python path: ['/home/pmantha/agent/../', '', '/usr/local/packages/saga/1.5.3/gcc-4.3.2/lib/python2.6.4/site-packages', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python26.zip', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6/plat-linux2', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6/lib-tk', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6/lib-old', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6/lib-dynload', '/usr/local/packages/python/2.6.4/gcc-4.3.2/lib/python2.6/site-packages']
Python version: (2, 6, 4, 'final', 0)
BigJob not installed. Attempt to install it.
Execute: python /home/pmantha/.bigjob/bigjob-bootstrap.py /home/pmantha/.bigjob/python/
Logging level: logging.DEBUG

Set logging level: logging.DEBUG/10

Running PBS epilogue script - 23-Apr-2012 21:15:36

BigJob agent error file:

[pmantha@oliver1 agent]$ cat stderr-bj-14d79a8c-8db3-11e1-a0cf-0060dd46c5e2-agent.txt
error: Download error for http://cloud.github.com/downloads/andymccurdy/redis-py/redis-2.2.4.tar.gz: [Errno 113] No route to host
04/23/2012 09:15:35 PM - bigjob - INFO - Loading BigJob version: 0.4.61 on oliver033
04/23/2012 09:15:35 PM - bigjob - DEBUG - Using SAGA C++/Python.
Traceback (most recent call last):
File "", line 46, in
File "/home/pmantha/.bigjob/python/lib/python2.6/site-packages/BigJob-0.4.61-py2.6.egg/bigjob/bigjob_agent.py", line 28, in
from threadpool import *
ImportError: No module named threadpool
[pmantha@oliver1 agent]$

Pilot-API doesn't consider environment and spmd_variation while creating job description.

Pilot-API doesn't consider environment and spmd_variation while creating job description.

pilot_manager.py..

def __translate_cu_sj_description(self, compute_unit_description):
    jd = saga.job.description()
    if compute_unit_description.has_key("executable"):
        jd.executable = compute_unit_description["executable"]
    jd.spmd_variation = "single"
    if compute_unit_description.has_key("arguments"):
        jd.arguments = compute_unit_description["arguments"]

    if compute_unit_description.has_key("number_of_processes"):
        jd.number_of_processes=str(compute_unit_description["number_of_processes"])
    else:
        jd.number_of_processes="1"

    if compute_unit_description.has_key("working_directory"):
        jd.working_directory = compute_unit_description["working_directory"]
    if compute_unit_description.has_key("output"):
        jd.output =  compute_unit_description["output"]
    if compute_unit_description.has_key("error"):
        jd.error = compute_unit_description["error"]
    if compute_unit_description.has_key("file_transfer"):
        jd.file_transfer=compute_unit_description["file_transfer"]
    return jd

Example Ambiguities

It would help if someone would go through the examples, one by one and make sure they are consistent. When I browse through the examples, I often come across things that are not clear. For example:

https://github.com/saga-project/BigJob/blob/master/examples/pilot-api/example-pilot-compute-direct.py

           
"total_core_count": 1,
"number_of_processes": 1,

Why is it using both?

In other examples:


        advert://advert.cct.lsu.edu:8080 (SAGA/Advert POSTGRESQL)
        advert://advert.cct.lsu.edu:5432 (SAGA/Advert POSTGRESQL)
        tcp://localhost (ZMQ)
        tcp://* (ZMQ - listening to all interfaces)

Do we really still support ZMQ or advert?


"working_directory": os.path.join(os.getcwd(),"work"),

Sometimes it is "agent" sometimes "work". Also, os.getcwd() is really should be replaced in all examples with something like $HOME/agent or os.getenv("HOME")+"/agent", otherwise the script might fail if not started in the right directory...

There are many more things that will confuse someone who just get started with BigJob. In terms of examples, I really think that less is more. Every line of code that can be omitted will make the example easier to understand.

Pilot-API doesn't schedule CUs in parallel between multiple resources

I have the below scenario -
I want to run
250 subjobs on india
250 subjobs on sierra
250 subjobs on hotel
250 subjobs on alamo

When all these jobs are submitted via pilot api script ( https://github.com/saga-project/experiments-SC12/blob/master/scripts/generic/bfast_perf.py)

I see CU's are scheduled sequentially for each resource ( i observed the pattern in which subjob directories wer created )

i.e 250 CU's ( subjob directories) are created on india.. ( no subjob directories are created on sierra, hotel, alamo )
then 250 CU's ( subjob directories) are created on sierra.. ( no subjob directories are created on hotel, alamo )
then 250 CU's ( subjob directories) are created on hotel.. ( no subjob directories are created on alamo )

I see , performance can be improved when CU's are scheduled in parallel between multiple resources.

environment variables not appended the final command which need to be executed in case of Kraken

I tried submitting job from rnager to kraken using xt5torque+gsissh
adaptor.. I see the environment variables not appended the final
command which need to be executed. I don't see it as a Bliss problem
from the below logs.. The launch method seems to causing this problem.
Could you please check it.

10/20/2012 10:16:44 PM - bigjob - DEBUG - aprun: False ssh: False
Launch method: local
10/20/2012 10:16:44 PM - bigjob - DEBUG - host: aprun8
nodes: 2

10/20/2012 10:16:45 PM - bigjob - DEBUG - Start job id
sj-5c8d6870-1b25-11e2-9ac5-00144f785784 specification {'Executable':
'/bin/echo', 'NumberOfProcesses': '2', 'start_time': '1350785805.63',
'Environment': "['ENV1=env_arg1', 'ENV2=env_arg2']", 'state': 'New',
'Arguments': "['Hello', '$ENV1', '$ENV2']", 'Error': 'stderr.txt',
'Output': 'stdout.txt', 'job-id':
'sj-5c8d6870-1b25-11e2-9ac5-00144f785784', 'SPMDVariation': 'mpi'}:

10/20/2012 10:16:45 PM - bigjob - DEBUG - Environment:
['ENV1=env_arg1', 'ENV2=env_arg2']
10/20/2012 10:16:45 PM - bigjob - DEBUG - Eval ENV1=env_arg1
10/20/2012 10:16:45 PM - bigjob - DEBUG - export ENV1=env_arg1;
10/20/2012 10:16:45 PM - bigjob - DEBUG - Eval ENV2=env_arg2
10/20/2012 10:16:45 PM - bigjob - DEBUG - export ENV1=env_arg1; export
ENV2=env_arg2;
10/20/2012 10:16:45 PM - bigjob - DEBUG - Expanded directory:
/bin/echo to /bin/echo
10/20/2012 10:16:45 PM - bigjob - DEBUG - stdout:
/nics/b/home/pmantha/agent/bj-f87c5a8a-1b24-11e2-8b61-00144f785784/sj-5c8d6870-1b25-11e2-9ac5-00144f785784/stdout.txt
stderr: /nics/b/home/pmantha/agent/bj-f87c5a8a-1b24-11e2-8b61-00144f785784/sj-5c8d6870-1b25-11e2-9ac5-00144f785784/stderr.txt
10/20/2012 10:16:45 PM - bigjob - DEBUG - allocate: aprun8
number nodes: 2 current busy nodes: [] free nodes: ['aprun8\n', 'aprun8\n']
10/20/2012 10:16:45 PM - bigjob - DEBUG - wrote machinefile:
/nics/b/home/pmantha/advert-launcher-machines-sj-5c8d6870-1b25-11e2-9ac5-00144f785784
Nodes: ['aprun8\n', 'aprun8\n']
10/20/2012 10:16:45 PM - bigjob - DEBUG - execute: cd
/nics/b/home/pmantha/agent/bj-f87c5a8a-1b24-11e2-8b61-00144f785784/sj-5c8d6870-1b25-11e2-9ac5-00144f785784;
/bin/echo Hello $ENV1 $ENV2 in
/nics/b/home/pmantha/agent/bj-f87c5a8a-1b24-11e2-8b61-00144f785784/sj-5c8d6870-1b25-11e2-9ac5-00144f785784
from: aprun8 (Shell: /bin/bash)

parsing url problem.

I see below problem when COORDINATION_URL="redis://[email protected]:6379" is used.

(mypython_repo)login2$ python example_local_single.py
Start Pilot Job/BigJob at: fork://localhost
Traceback (most recent call last):
File "/home1/01539/pmantha/BigJob/examples/../bigjob/bigjob_manager.py", line 670, in __parse_url
if query.endswith("/"):
AttributeError: 'NoneType' object has no attribute 'endswith'
Pilot Job/BigJob URL: bigjob:bj-08660808-c597-11e1-8c33-f04da2004b3c:localhost State: Unknown
state: Unknown
state: Running
state: Running
state: Running
state: Done
(mypython_repo)login2$

Pilot-api fails to cancel pj's - BJ version 0.4.64 and Bliss 0.2.0

Here is the output trace of the example execution.

I submitted 4 pilots using Pilot-api and bliss.

  1. ssh://localhost - worked and terminated automatically
  2. pbs://localhost - queued and not terminated automatically.
  3. pbs+ssh://sierra.futuregrid.org - worked and terminated automatically
  4. pbs+ssh://alamo.futuregrid.org - worked and terminated automatically

The below threading error need to be handled properly

05/16/2012 03:58:21 PM - bigjob - DEBUG - ### END WAIT ###
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job
05/16/2012 03:58:21 PM - bigjob - DEBUG - stop pilot job: bigjob:bj-7ab51404-9f99-11e1-adff-02215ecdd007:localhost
05/16/2012 03:58:21 PM - bigjob - DEBUG - update state of pilot job to: Done stopped: True
05/16/2012 03:58:21 PM - bigjob - DEBUG - delete pilot job: bigjob:bj-7ab51404-9f99-11e1-adff-02215ecdd007:localhost
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job finished
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job
05/16/2012 03:58:21 PM - bigjob - DEBUG - stop pilot job: bigjob:bj-7b52bb64-9f99-11e1-adff-02215ecdd007:localhost
05/16/2012 03:58:21 PM - bigjob - DEBUG - update state of pilot job to: Done stopped: True
05/16/2012 03:58:21 PM - bigjob - DEBUG - delete pilot job: bigjob:bj-7b52bb64-9f99-11e1-adff-02215ecdd007:localhost
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job finished
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job
05/16/2012 03:58:21 PM - bigjob - DEBUG - stop pilot job: bigjob:bj-7be6714c-9f99-11e1-adff-02215ecdd007:sierra.futuregrid.org
05/16/2012 03:58:21 PM - bigjob - DEBUG - update state of pilot job to: Done stopped: True
05/16/2012 03:58:21 PM - bigjob - DEBUG - delete pilot job: bigjob:bj-7be6714c-9f99-11e1-adff-02215ecdd007:sierra.futuregrid.org
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job finished
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job
05/16/2012 03:58:21 PM - bigjob - DEBUG - stop pilot job: bigjob:bj-8d7afdd8-9f99-11e1-adff-02215ecdd007:alamo.futuregrid.org
05/16/2012 03:58:21 PM - bigjob - DEBUG - update state of pilot job to: Done stopped: True
05/16/2012 03:58:21 PM - bigjob - DEBUG - delete pilot job: bigjob:bj-8d7afdd8-9f99-11e1-adff-02215ecdd007:alamo.futuregrid.org
05/16/2012 03:58:21 PM - bigjob - DEBUG - Cancel Pilot Job finished
05/16/2012 03:58:27 PM - bigjob - DEBUG - Re-Scheduler terminated
Exception in thread Thread-2 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/soft/python/gnu-4.1/2.7/lib/python2.7/threading.py", line 530, in __bootstrap_inner
File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1574, in run
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'
Exception in thread Thread-6 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/soft/python/gnu-4.1/2.7/lib/python2.7/threading.py", line 530, in __bootstrap_inner
File "build/bdist.linux-x86_64/egg/paramiko/transport.py", line 1574, in run
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'error'
(mypython_repo)[pmantha@login2 futuregrid]$

BigJob 'swallows' Bliss log message

If I export SAGA_VERBOSE=5

and then run

BIGJOB_VERBOSE=100 python bigjob-example.py

I only see BigJob log messages and no Bliss log messages.

Logging of BigJob/Pilot-API: Include Runtime statistics and information

Several modifications needs to be done to the Pilot-API/BigJob logging system.

Current logging of BigJob is more useful to developer/testers but not for naive users. Users get very minimal information from the output trace.

The logging hierarchy should be something like this and different handlers should be used at each level

  1. Application level - ( user is responsible ).
    • Users use this for their information purpose.
  2. Pilot-API level - ( Pilot-API logging handler should be used )
    • Pilot-API should print useful info like the
      Number of PJ's created.
      Number of compute units created.
      and time taken for creation

Any information which user doesn't need to know should be a debug message.

  1. BigJob level - ( BigJob logging handler should be used )
    • BigJob queue wait time.
    • Number of subjobs associated with BigJob and their statuses
    • Average run time of all jobs
  2. BigData level - ( BigData logging handler should be used )
    • How many files moved.
    • File movement information ( like source,destination, working directory ) and associated compute information
    • Amount of data in "MB" moved.
  3. Bliss level - ( Bliss logging handler should be used )

All these messages should be printed at logging.INFO level.
User can vary the log messages using environment variables. PILOT_VERBOSE, SAGA_VERBOSE.
Currently, I think the logging at one level effects logging at another level.

Tutorial issue: Pilot-API not cancelling pilot-jobs.

In pilotcompute_manager.py cancel is not cancelling all pilots

def cancel(self):
""" Cancel the PilotJobService.

        This also cancels all the PilotJobs that were under control of this PJS.

        Keyword arguments:
        None

        Return value:
        Result of operation
    """
    pass
    #self.__mjs.cancel()

number_of_processes not recognized in case of Bliss XT5torque plugin

Hi!

I used the below pilot and compute unit descriptions to submit Pilot and CUs on Kraken. But the number_of_processes is always 1. Please find the output trace below, could you please check this.

pilot_compute_description.append({ "service_url": 'xt5torque://localhost',
                              "number_of_processes":12,
                              "walltime":10,
                              "processes_per_node":1,
                              "queue":"small",
                              "allocation":"TG-MCB100111",
                              "working_directory": "/lustre/scratch/rmukherj/agent/",
                            })

     compute_unit_description = {
            "executable": "$HOME/NAMD",
            "arguments":["dyn1.conf"],
            "total_cpu_count": 12,
            "number_of_processes":12,
            "spmd_variation":"mpi",
            "working_directory":"$HOME/WORK/floer-sims-Na/"+chrom_no+dir_no,
            "output": "dyn1.out",
            "error":"dyn1.err"
            }

08/07/2012 10:47:05 AM - bigjob - DEBUG - Submit pilot job to: xt5torque://localhost
08/07/2012 10:47:06 AM - bigjob - DEBUG - Create PilotCompute for BigJob: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost
08/07/2012 10:47:06 AM - bigjob - DEBUG - nocoord://localhost/bigjob/
08/07/2012 10:47:06 AM - bigjob - DEBUG - CDS URL: nocoord://localhost/bigjob/pilot/cds/cds-c22ad72e-e09e-11e1-825e-00003e980000
08/07/2012 10:47:06 AM - bigjob - DEBUG - Create CDS directory at nocoord://localhost/bigjob/pilot/cds/cds-c22ad72e-e09e-11e1-825e-00003e980000?dbtype=sqlite3
08/07/2012 10:47:06 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:06 AM - bigjob - DEBUG - Created CU: nocoord://localhost/bigjob/pilot/cds/cds-c22ad72e-e09e-11e1-825e-00003e980000/cu-c22b05a0-e09e-11e1-825e-00003e980000
Waiting for compute units to complete
08/07/2012 10:47:06 AM - bigjob - DEBUG - ### START WAIT ###
08/07/2012 10:47:07 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:07 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:07 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:07 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:07 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:07 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Unknown
08/07/2012 10:47:07 AM - bigjob - DEBUG - Candidate PJs: []
08/07/2012 10:47:07 AM - bigjob - DEBUG - No resource found.
08/07/2012 10:47:12 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:13 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:13 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:13 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:13 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:13 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:13 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Unknown
08/07/2012 10:47:13 AM - bigjob - DEBUG - Candidate PJs: []
08/07/2012 10:47:13 AM - bigjob - DEBUG - No resource found.
08/07/2012 10:47:18 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:19 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:19 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:19 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:19 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:19 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:19 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Unknown
08/07/2012 10:47:19 AM - bigjob - DEBUG - Candidate PJs: []
08/07/2012 10:47:19 AM - bigjob - DEBUG - No resource found.
08/07/2012 10:47:24 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:25 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:25 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:25 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:25 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:25 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:25 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Unknown
08/07/2012 10:47:25 AM - bigjob - DEBUG - Candidate PJs: []
08/07/2012 10:47:25 AM - bigjob - DEBUG - No resource found.
08/07/2012 10:47:30 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:31 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:31 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:31 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:31 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:31 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:31 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Unknown
08/07/2012 10:47:31 AM - bigjob - DEBUG - Candidate PJs: []
08/07/2012 10:47:31 AM - bigjob - DEBUG - No resource found.
08/07/2012 10:47:36 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Data
08/07/2012 10:47:37 AM - bigjob - DEBUG - Scheduler Thread: <class 'pilot.impl.pilot_manager.ComputeDataService'> Pilot Job
08/07/2012 10:47:37 AM - bigjob - DEBUG - Schedule CU
08/07/2012 10:47:37 AM - bigjob - DEBUG - __update_scheduler_resources
08/07/2012 10:47:37 AM - bigjob - DEBUG - Pilot-Jobs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:37 AM - bigjob - DEBUG - Schedule to PJ - # Avail PJs: 1
08/07/2012 10:47:38 AM - bigjob - DEBUG - BJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost State: Running
08/07/2012 10:47:38 AM - bigjob - DEBUG - Candidate PJs: [bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost]
08/07/2012 10:47:38 AM - bigjob - DEBUG - Submit CU to big-job
08/07/2012 10:47:38 AM - bigjob - DEBUG - add subjob to queue of PJ: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost
08/07/2012 10:47:38 AM - bigjob - DEBUG - create dictionary for job description. Job-URL: bigjob:bj-bffb4c90-e09e-11e1-825e-00003e980000:localhost:jobs:sj-d5006f62-e09e-11e1-ae02-00003e980000
08/07/2012 10:47:38 AM - bigjob - DEBUG - SJ Attributes: {'Executable' : '$HOME/NAMD','WorkingDirectory' : '$HOME/WORK/floer-sims-Na/chr03/0078850','SPMDVariation' : 'mpi','Arguments' : '['dyn1.conf']','Error' : 'dyn1.err','Output' : 'dyn1.out','TotalCPUCount' : '12',}
08/07/2012 10:47:38 AM - bigjob - DEBUG - job dict: {'Executable': '$HOME/NAMD', 'NumberOfProcesses': '1', 'WorkingDirectory': '$HOME/WORK/floer-sims-Na/chr03/0078850', 'TotalCPUCount': 12, 'state': 'Unknown', 'Arguments': ['dyn1.conf'], 'Error': 'dyn1.err', 'Output': 'dyn1.out', 'job-id': 'sj-d5006f62-e09e-11e1-ae02-00003e980000', 'SPMDVariation': 'mpi'}

Allow for default coordination mechanism

Especially with BigJob deployment on XSEDE and the fact that we decided to move to Redis, it doesn't make too much sense to have the user explicitly define the URL every single time:

COORDINATION_URL = "redis://password@hostname:6379"
bj = bigjob(COORDINATION_URL)

since it will always be the same for a given deployment. I propose to add an option to setup.py that allows to set a default URL for BigJob coordination, e.g., BIGJOB_DEFAULT_COORDINATION=redis://quary... easy_install BigJob, similar to what I do with the connection setting in bigjob-server: https://github.com/oweidner/bigjob-server/wiki/Installation.

Subsequently, the user can leave the coordination_url in the BigJob constructor blank and/or use a predefined variable, i.e.,

bj = bigjob()

or

bj = bigjob(bigjob.DEFAULT_COORD)

One could think about the same for the pilot-job LRMS, e.g.,

bj.start_pilot_job(bigjob.DEFAULT_RM,  (instead of having to remember e.g., gram://lonestar...whatever)

I think being able to define optional default options for machine-specific deployments would make BigJob more user-friendly.

Different log handlers need to be used for BigJob manager and agent

Hi!

I think different handlers need to be used for BigJob Manager and agent.

Currently log is set at BigJob level. if it is set to FATAL, just to print critical message, important agent log information is also affected and doesn't print anything in the agent log file.

thanks

walltime should be wall_time_limit

The job description parameter in saga is called WallTimeLimit - by convention, this translates to wall_time_limit for python properties.

However, BigJob just uses 'walltime' even though it claims to be compatible with SAGA JD. Can the parameter be changed to wall_time_limit or at least can BigJob support both?

What's the parameter called in the Pilot-API ?

Resolve REDIS Server Issues [Wiki]

Currently, we need to inform users of the need to use their own redis server to run BigJob. We also need to come to a concensus on the easiest way for users to start a redis server to do the tutorials.

Andre Merzky proposes a solution in an email entitled "REDIS redux"

To quote his email:

I wrote a small server which does exactly that: it creates redis
instances on demand and returns its port and password. That instance
is then killed after a certain time (by default 24 hours, but can be
extended at will).

It should be quite easy to include that in the tutorials. For example:

echo "REDIS CREATE" | netcat -q 2 localhost 2000

201 creating redis instance '299ea641-94e0-4a2c-902b-f7000d8fd2b9':
redis://localhost:10000/

gives a unique redis instance -- and it gets cleaned up after the
tutorial. The instances are independent from each other, and as
secure as any other redis installation (AFAICS).

Enhance language of tutorial to be more user-friendly [Wiki]

The word "virtual environment" is used in the first paragraph of the tutorial. To unfamiliar users, this could seem like a deterrent.

In general, the tutorial needs another read-over to make it more user-friendly and easy to understand for those that are unfamiliar with what BigJob is or how they can use it to do production-grade science.

Redis URL + Password (!!) hardcoded in BigJob

This is a major security risk. Please remove IMMEDIATELY!

grep -lr "ILikeBigJob_wITH-REdIS" * 
software/bigjob-bliss/lib64/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/pilot/impl/pilotcompute_manager.py 
software/bigjob-bliss/lib64/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/pilot/impl/pilotcompute_manager.pyc
software/bigjob-bliss/lib64/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/bigjob/bigjob_manager.py
software/bigjob-bliss/lib64/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/bigjob/bigjob_manager.pyc
software/bigjob-bliss/lib/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/pilot/impl/pilotcompute_manager.py
software/bigjob-bliss/lib/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/pilot/impl/pilotcompute_manager.pyc
software/bigjob-bliss/lib/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/bigjob/bigjob_manager.py
software/bigjob-bliss/lib/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/bigjob/bigjob_manager.pyc

And

cat software/bigjob-bliss/lib64/python2.6/site-packages/BigJob-0.4.64-py2.6.egg/pilot/impl/pilotcompute_manager.py | grep redispackages
COORDINATION_URL = "redis://[email protected]:6379"

Tutorial Issue #2: Logging inconsistencies.

(mypython_repo)[pmantha@login2 ~]$ python example_fg.py
Read log level from bigjob.conf
Logging level: logging.INFO
06/07/2012 03:45:03 PM - bigjob - INFO - Loading BigJob version: 0.4.64 on login2.uc.futuregrid.org
06/07/2012 03:45:03 PM - bigjob - WARNING - SAGA C++ and Python bindings not found. Using Bliss.
06/07/2012 03:45:03 PM - bigjob - WARNING - WebHDFS package not found.
06/07/2012 03:45:03 PM - bigjob - WARNING - Globus Online package not found.
06/07/2012 03:45:04 PM - SSHJobPlugin(0x1ae2dcf8) - WARNING - Silently ignoring the walltime_limit attribute. It's not supported by SSH.
06/07/2012 03:45:04 PM - SSHJobPlugin(0x1ae2dcf8) - WARNING - Silently ignoring the total_cpu_count attribute. It's not supported by SSH.
06/07/2012 03:45:04 PM - SSHJobPlugin(0x1ae2dcf8) - WARNING - Silently ignoring the spmd_variation attribute. It's not supported by SSH.
06/07/2012 03:45:06 PM - PJApp - INFO - Finished setup for single core jobs
06/07/2012 03:45:15 PM - PJApp - INFO - Finished Waiting for scheduled single core compute units
06/07/2012 03:45:15 PM - PJApp - INFO - Finished setup for mpi jobs
06/07/2012 03:45:31 PM - PJApp - INFO - Finished Waiting for scheduled mpi compute units
06/07/2012 03:45:31 PM - PJApp - INFO - Terminate Pilot Compute and Compute Data Service

Bliss log messages "SSHJobPlugin" are printed - which are not printed when Standalone Bliss ssh job. BigJob is messing up with Bliss logging somewhere.

(mypython_repo)[pmantha@login2 job-api]$ python ssh_simple_job.py
Job ID : [ssh://localhost]-[None]
Job State : saga.job.Job.New

...starting job...

Job ID : [ssh://localhost]-[19265]
Job State : saga.job.Job.Running

...waiting for job...

Job State : saga.job.Job.Done
Exitcode : 0
(mypython_repo)[pmantha@login2 job-api]$

Below messages can be still obtained as debug messages and Users are not concerned about them, Can we change them to debug?
06/07/2012 03:45:03 PM - bigjob - WARNING - WebHDFS package not found.
06/07/2012 03:45:03 PM - bigjob - WARNING - Globus Online package not found.

Messages at Pilot-API level and statistics about subjob execution are helpful.

Pilot-API doesn't automatically balance compute units between multiple Pilot-Jobs.

Two Pilots were launched using pbs-ssh on india and sierra . The example script was executed on India.
Total 10 compute units were created.

  1. Pilot on india is in queue state.
  2. Pilot on sierra is in running state.

4 compute units are executed on sierra. remaining 6 are in unknown state..

I think since pilot on India is stucked in queue, the remianing 6 compute units should be pulled by sierra pilot and executed on sierra. But this is not happening.

Example script:

[sivak2606@i136 pilot-api]$ cat example-pilot-api-multiple.py
import sys
import os
import time
import logging
logging.basicConfig(level=logging.DEBUG)

sys.path.append(os.path.join(os.path.dirname(file), "../.."))
sys.path.insert(0, os.getcwd() + "/../")
from pilot import PilotComputeService, ComputeDataService, State

if name == "main":

pilot_compute_service = PilotComputeService()

# create pilot job service and initiate a pilot job
pilot_compute_description = {
                         "service_url": 'pbs-ssh://india.futuregrid.org',
                         "number_of_processes": 32,                             
                         "processes_per_node":8,
                         "working_directory": "/N/u/sivak2606/agent",
                         "walltime":60
                        }

pilot_compute_description2 = {
                         "service_url": 'pbs-ssh://sierra.futuregrid.org',
                         "number_of_processes": 32,
                         "processes_per_node":8,
                         "working_directory": "/N/u/sivak2606/agent",
                         "walltime":60
                        }

pilotjob = pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description)
pilotjob2 = pilot_compute_service.create_pilot(pilot_compute_description=pilot_compute_description2)

compute_data_service = ComputeDataService()
compute_data_service.add_pilot_compute_service(pilot_compute_service)

# start work unit
compute_unit_description = {
        "executable": "/bin/hostname",
        "arguments": [""],
        "total_core_count": 1,
        "number_of_processes": 1,            
        "output": "stdout.txt",
        "error": "stderr.txt",   
}

cus = []
for i in range(0,10):
    compute_unit = compute_data_service.submit_compute_unit(compute_unit_description)


logging.debug("Finished setup. Waiting for scheduling of CU")
compute_data_service.wait()


logging.debug("Terminate Pilot Compute and Compute Data Service")
compute_data_service.cancel()    
pilot_compute_service.cancel()

[sivak2606@i136 pilot-api]$

Job on India is still on queue

387510.i136 bigjob_pbs_ssh sivak2606 0 Q batch
387511.i136 sub26448.16.sub charngda 0 Q batch
[sivak2606@i136 pilot-api]$ hostname
i136
[sivak2606@i136 pilot-api]$

agent log file on sierra

[sivak2606@s1 bj-4ab3dd80-8199-11e1-96da-e61f1322a75c]$ cat stderr-bigjob_agent.txt
torque/2.4.8 version 2.4.8 loaded
Intel compiler suite version 11.1/072 loaded
OpenMPI version 1.4.2 loaded
04/08/2012 09:38:51 AM - bigjob - INFO - Loading BigJob version: 0.4.47
04/08/2012 09:38:51 AM - bigjob - DEBUG - Using SAGA C++/Python.
04/08/2012 09:38:51 AM - root - DEBUG - ['/N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/../', '', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools_git-0.4.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/bliss-0.1.17-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/openssh_wrapper-0.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/u/sivak2606/BigJob', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/sivak2606/BigJob/bigjob', '/N/u/sivak2606/BigJob/bigjob/../../ext/threadpool-1.2.7/src/']
04/08/2012 09:38:51 AM - bigjob - DEBUG - Python Version: sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0)
04/08/2012 09:38:51 AM - root - DEBUG - read configfile: /N/u/sivak2606/BigJob/bigjob/../bigjob_agent.conf
/bin/sh: aprun: command not found
torque/2.4.8 version 2.4.8 loaded
Intel compiler suite version 11.1/072 loaded
OpenMPI version 1.4.2 loaded
04/08/2012 09:38:52 AM - bigjob - DEBUG - aprun: False ssh: True Launch method: ssh
04/08/2012 09:38:52 AM - root - DEBUG - Launch Method: ssh mpi: mpirun shell: /bin/bash
04/08/2012 09:38:52 AM - bigjob - DEBUG - host: s15
nodes: 8
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s15
04/08/2012 09:38:52 AM - bigjob - DEBUG - host: s14
nodes: 8
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s14
04/08/2012 09:38:52 AM - bigjob - DEBUG - host: s16
nodes: 8
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s16
04/08/2012 09:38:52 AM - bigjob - DEBUG - host: s13
nodes: 8
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - add host: s13
04/08/2012 09:38:52 AM - bigjob - DEBUG - parsing ID out of URL: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:38:52 AM - bigjob - DEBUG - BigJob Agent arguments: ['bigjob_agent.py', 'redis://[email protected]:6379', 'bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org']
04/08/2012 09:38:52 AM - bigjob - DEBUG - Initialize C&C subsystem to pilot-url: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:38:52 AM - bigjob - DEBUG - BigJob ID: bj-4ab3dd80-8199-11e1-96da-e61f1322a75c
04/08/2012 09:38:52 AM - root - DEBUG - ['/N/u/sivak2606/BigJob/coordination/../ext/redis-2.4.9/', '/N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/../', '', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools_git-0.4.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/bliss-0.1.17-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/paramiko_on_pypi-1.7.6-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/openssh_wrapper-0.2-py2.7.egg', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/pycrypto_on_pypi-2.3-py2.7-linux-x86_64.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/u/sivak2606/BigJob', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages', '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages', '/N/u/sivak2606/BigJob/bigjob', '/N/u/sivak2606/BigJob/bigjob/../../ext/threadpool-1.2.7/src/']
04/08/2012 09:38:52 AM - bigjob - DEBUG - Utilizing Redis Backend: redis://[email protected]:6379. Please make sure Redis server is configured in bigjob_coordination_redis.py
04/08/2012 09:38:52 AM - bigjob - DEBUG - Connect to Redis: gw68.quarry.iu.teragrid.org Port: 6379
04/08/2012 09:38:53 AM - bigjob - DEBUG - set state to : Running
04/08/2012 09:38:53 AM - bigjob - DEBUG - update state of pilot job to: Running
04/08/2012 09:38:53 AM - bigjob - DEBUG - ##################################### New POLL/MONITOR cycle ##################################
04/08/2012 09:38:53 AM - bigjob - DEBUG - Free nodes: 32 Busy Nodes: 0
04/08/2012 09:38:53 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:38:53 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:38:53 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:38:53 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:38:53 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 0
04/08/2012 09:38:53 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue number queued items: 1
04/08/2012 09:38:54 AM - bigjob - DEBUG - Dequeued: ('bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue', 'bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-52de8a96-8199-11e1-9833-e61f1322a75c')
04/08/2012 09:38:54 AM - bigjob - DEBUG - Dequed:bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-52de8a96-8199-11e1-9833-e61f1322a75c
04/08/2012 09:38:54 AM - bigjob - DEBUG - Get job description
04/08/2012 09:38:54 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:38:54 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:38:54 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue number queued items: 0
04/08/2012 09:38:54 AM - bigjob - DEBUG - start job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-52de8a96-8199-11e1-9833-e61f1322a75c data: {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'Unknown', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-52de8a96-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}
04/08/2012 09:38:54 AM - bigjob - DEBUG - set job state to: New
04/08/2012 09:38:54 AM - bigjob - DEBUG - Start job id sj-52de8a96-8199-11e1-9833-e61f1322a75c specification {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'New', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-52de8a96-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}:
04/08/2012 09:38:54 AM - root - DEBUG - Sub-Job: sj-52de8a96-8199-11e1-9833-e61f1322a75c, Working_directory: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c
04/08/2012 09:38:54 AM - bigjob - DEBUG - stdout: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c/stdout.txt stderr: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c/stderr.txt
04/08/2012 09:38:54 AM - bigjob - DEBUG - allocate: s15
number nodes: 8 current busy nodes: [] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n']
04/08/2012 09:38:54 AM - bigjob - DEBUG - allocate: s14
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n']
04/08/2012 09:38:54 AM - bigjob - DEBUG - allocate: s16
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n']
04/08/2012 09:38:54 AM - bigjob - DEBUG - allocate: s13
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n']
04/08/2012 09:38:54 AM - bigjob - DEBUG - wrote machinefile: /N/u/sivak2606/advert-launcher-machines-sj-52de8a96-8199-11e1-9833-e61f1322a75c Nodes: ['s15\n']
04/08/2012 09:38:54 AM - bigjob - DEBUG - execute: ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c; /bin/hostname " in /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c from: s13 (Shell: /bin/bash)
04/08/2012 09:38:54 AM - bigjob - DEBUG - started ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-52de8a96-8199-11e1-9833-e61f1322a75c; /bin/hostname "
04/08/2012 09:38:54 AM - bigjob - DEBUG - set job state to: Running
04/08/2012 09:38:58 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:38:58 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:38:58 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 1
04/08/2012 09:38:58 AM - bigjob - DEBUG - Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-52de8a96-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname state: 0 return code: 0
04/08/2012 09:38:59 AM - bigjob - DEBUG - Job successful: Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-52de8a96-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname - set state to Done
04/08/2012 09:38:59 AM - bigjob - DEBUG - Update output file...
04/08/2012 09:38:59 AM - bigjob - DEBUG - Files: ['bigjob_pbs_ssh', 'sj-52de8a96-8199-11e1-9833-e61f1322a75c']
04/08/2012 09:38:59 AM - bigjob - DEBUG - set job state to: Done
04/08/2012 09:38:59 AM - bigjob - DEBUG - Machine file: /N/u/sivak2606/advert-launcher-machines-sj-52de8a96-8199-11e1-9833-e61f1322a75c
04/08/2012 09:38:59 AM - bigjob - DEBUG - Free nodes: ['s15\n']
04/08/2012 09:38:59 AM - bigjob - DEBUG - free node: s15
current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n']
04/08/2012 09:38:59 AM - bigjob - DEBUG - Delete /N/u/sivak2606/advert-launcher-machines-sj-52de8a96-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:04 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:04 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:04 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 1
04/08/2012 09:39:09 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:09 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:09 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 1
04/08/2012 09:39:09 AM - bigjob - DEBUG - Dequeued: ('bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue', 'bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c')
04/08/2012 09:39:09 AM - bigjob - DEBUG - Dequed:bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:09 AM - bigjob - DEBUG - Get job description
04/08/2012 09:39:09 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:09 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:39:10 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue number queued items: 0
04/08/2012 09:39:10 AM - bigjob - DEBUG - start job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c data: {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'Unknown', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}
04/08/2012 09:39:10 AM - bigjob - DEBUG - set job state to: New
04/08/2012 09:39:10 AM - bigjob - DEBUG - Start job id sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c specification {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'New', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}:
04/08/2012 09:39:10 AM - root - DEBUG - Sub-Job: sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c, Working_directory: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:10 AM - bigjob - DEBUG - stdout: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c/stdout.txt stderr: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c/stderr.txt
04/08/2012 09:39:10 AM - bigjob - DEBUG - allocate: s15
number nodes: 8 current busy nodes: [] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n']
04/08/2012 09:39:10 AM - bigjob - DEBUG - allocate: s14
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n']
04/08/2012 09:39:10 AM - bigjob - DEBUG - allocate: s16
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n']
04/08/2012 09:39:10 AM - bigjob - DEBUG - allocate: s13
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n']
04/08/2012 09:39:10 AM - bigjob - DEBUG - wrote machinefile: /N/u/sivak2606/advert-launcher-machines-sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c Nodes: ['s15\n']
04/08/2012 09:39:10 AM - bigjob - DEBUG - execute: ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c; /bin/hostname " in /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c from: s13 (Shell: /bin/bash)
04/08/2012 09:39:10 AM - bigjob - DEBUG - started ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c; /bin/hostname "
04/08/2012 09:39:10 AM - bigjob - DEBUG - set job state to: Running
04/08/2012 09:39:14 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:14 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:14 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 2
04/08/2012 09:39:14 AM - bigjob - DEBUG - Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname state: 0 return code: 0
04/08/2012 09:39:14 AM - bigjob - DEBUG - Job successful: Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname - set state to Done
04/08/2012 09:39:14 AM - bigjob - DEBUG - Update output file...
04/08/2012 09:39:14 AM - bigjob - DEBUG - Files: ['bigjob_pbs_ssh', 'sj-52de8a96-8199-11e1-9833-e61f1322a75c', 'sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c']
04/08/2012 09:39:14 AM - bigjob - DEBUG - set job state to: Done
04/08/2012 09:39:14 AM - bigjob - DEBUG - Machine file: /N/u/sivak2606/advert-launcher-machines-sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:14 AM - bigjob - DEBUG - Free nodes: ['s15\n']
04/08/2012 09:39:14 AM - bigjob - DEBUG - free node: s15
current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n']
04/08/2012 09:39:14 AM - bigjob - DEBUG - Delete /N/u/sivak2606/advert-launcher-machines-sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:19 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:19 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:19 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 2
04/08/2012 09:39:21 AM - bigjob - DEBUG - Dequeued: ('bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue', 'bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-64c653ce-8199-11e1-9833-e61f1322a75c')
04/08/2012 09:39:21 AM - bigjob - DEBUG - Dequed:bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-64c653ce-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:21 AM - bigjob - DEBUG - Get job description
04/08/2012 09:39:21 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:21 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:39:22 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue number queued items: 0
04/08/2012 09:39:22 AM - bigjob - DEBUG - start job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-64c653ce-8199-11e1-9833-e61f1322a75c data: {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'Unknown', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-64c653ce-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}
04/08/2012 09:39:22 AM - bigjob - DEBUG - set job state to: New
04/08/2012 09:39:22 AM - bigjob - DEBUG - Start job id sj-64c653ce-8199-11e1-9833-e61f1322a75c specification {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'New', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-64c653ce-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}:
04/08/2012 09:39:22 AM - root - DEBUG - Sub-Job: sj-64c653ce-8199-11e1-9833-e61f1322a75c, Working_directory: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:22 AM - bigjob - DEBUG - stdout: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c/stdout.txt stderr: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c/stderr.txt
04/08/2012 09:39:22 AM - bigjob - DEBUG - allocate: s15
number nodes: 8 current busy nodes: [] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n']
04/08/2012 09:39:22 AM - bigjob - DEBUG - allocate: s14
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n']
04/08/2012 09:39:22 AM - bigjob - DEBUG - allocate: s16
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n']
04/08/2012 09:39:22 AM - bigjob - DEBUG - allocate: s13
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n']
04/08/2012 09:39:22 AM - bigjob - DEBUG - wrote machinefile: /N/u/sivak2606/advert-launcher-machines-sj-64c653ce-8199-11e1-9833-e61f1322a75c Nodes: ['s15\n']
04/08/2012 09:39:22 AM - bigjob - DEBUG - execute: ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c; /bin/hostname " in /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c from: s13 (Shell: /bin/bash)
04/08/2012 09:39:22 AM - bigjob - DEBUG - started ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-64c653ce-8199-11e1-9833-e61f1322a75c; /bin/hostname "
04/08/2012 09:39:22 AM - bigjob - DEBUG - set job state to: Running
04/08/2012 09:39:24 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:24 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:24 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 3
04/08/2012 09:39:25 AM - bigjob - DEBUG - Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-64c653ce-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname state: 0 return code: 0
04/08/2012 09:39:25 AM - bigjob - DEBUG - Job successful: Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-64c653ce-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname - set state to Done
04/08/2012 09:39:25 AM - bigjob - DEBUG - Update output file...
04/08/2012 09:39:25 AM - bigjob - DEBUG - Files: ['bigjob_pbs_ssh', 'sj-52de8a96-8199-11e1-9833-e61f1322a75c', 'sj-64c653ce-8199-11e1-9833-e61f1322a75c', 'sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c']
04/08/2012 09:39:25 AM - bigjob - DEBUG - set job state to: Done
04/08/2012 09:39:25 AM - bigjob - DEBUG - Machine file: /N/u/sivak2606/advert-launcher-machines-sj-64c653ce-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:25 AM - bigjob - DEBUG - Free nodes: ['s15\n']
04/08/2012 09:39:25 AM - bigjob - DEBUG - free node: s15
current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n']
04/08/2012 09:39:25 AM - bigjob - DEBUG - Delete /N/u/sivak2606/advert-launcher-machines-sj-64c653ce-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:27 AM - bigjob - DEBUG - Dequeued: ('bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue', 'bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-685b1308-8199-11e1-9833-e61f1322a75c')
04/08/2012 09:39:27 AM - bigjob - DEBUG - Dequed:bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-685b1308-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:27 AM - bigjob - DEBUG - Get job description
04/08/2012 09:39:27 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:27 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org
04/08/2012 09:39:28 AM - bigjob - DEBUG - Dequeue sub-job from: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:queue number queued items: 0
04/08/2012 09:39:28 AM - bigjob - DEBUG - start job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-685b1308-8199-11e1-9833-e61f1322a75c data: {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'Unknown', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-685b1308-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}
04/08/2012 09:39:28 AM - bigjob - DEBUG - set job state to: New
04/08/2012 09:39:28 AM - bigjob - DEBUG - Start job id sj-685b1308-8199-11e1-9833-e61f1322a75c specification {'Executable': '/bin/hostname', 'NumberOfProcesses': '1', 'state': 'New', 'Arguments': "['']", 'Error': 'stderr.txt', 'Output': 'stdout.txt', 'job-id': 'sj-685b1308-8199-11e1-9833-e61f1322a75c', 'SPMDVariation': 'single'}:
04/08/2012 09:39:28 AM - root - DEBUG - Sub-Job: sj-685b1308-8199-11e1-9833-e61f1322a75c, Working_directory: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:28 AM - bigjob - DEBUG - stdout: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c/stdout.txt stderr: /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c/stderr.txt
04/08/2012 09:39:28 AM - bigjob - DEBUG - allocate: s15
number nodes: 8 current busy nodes: [] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n', 's15\n']
04/08/2012 09:39:28 AM - bigjob - DEBUG - allocate: s14
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n', 's15\n']
04/08/2012 09:39:28 AM - bigjob - DEBUG - allocate: s16
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n', 's15\n']
04/08/2012 09:39:28 AM - bigjob - DEBUG - allocate: s13
number nodes: 8 current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n', 's15\n']
04/08/2012 09:39:28 AM - bigjob - DEBUG - wrote machinefile: /N/u/sivak2606/advert-launcher-machines-sj-685b1308-8199-11e1-9833-e61f1322a75c Nodes: ['s15\n']
04/08/2012 09:39:28 AM - bigjob - DEBUG - execute: ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c; /bin/hostname " in /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c from: s13 (Shell: /bin/bash)
04/08/2012 09:39:28 AM - bigjob - DEBUG - started ssh s15 "cd /N/u/sivak2606/agent/bj-4ab3dd80-8199-11e1-96da-e61f1322a75c/sj-685b1308-8199-11e1-9833-e61f1322a75c; /bin/hostname "
04/08/2012 09:39:28 AM - bigjob - DEBUG - set job state to: Running
04/08/2012 09:39:30 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:30 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:30 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:39:30 AM - bigjob - DEBUG - Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-685b1308-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname state: 0 return code: 0
04/08/2012 09:39:30 AM - bigjob - DEBUG - Job successful: Job: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org:jobs:sj-685b1308-8199-11e1-9833-e61f1322a75c Excutable: /bin/hostname - set state to Done
04/08/2012 09:39:30 AM - bigjob - DEBUG - Update output file...
04/08/2012 09:39:30 AM - bigjob - DEBUG - Files: ['sj-685b1308-8199-11e1-9833-e61f1322a75c', 'bigjob_pbs_ssh', 'sj-52de8a96-8199-11e1-9833-e61f1322a75c', 'sj-64c653ce-8199-11e1-9833-e61f1322a75c', 'sj-5d9d4ecc-8199-11e1-9833-e61f1322a75c']
04/08/2012 09:39:30 AM - bigjob - DEBUG - set job state to: Done
04/08/2012 09:39:30 AM - bigjob - DEBUG - Machine file: /N/u/sivak2606/advert-launcher-machines-sj-685b1308-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:30 AM - bigjob - DEBUG - Free nodes: ['s15\n']
04/08/2012 09:39:30 AM - bigjob - DEBUG - free node: s15
current busy nodes: ['s15\n'] free nodes: ['s15\n', 's15\n', 's15\n', 's15\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's14\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's16\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's13\n', 's15\n', 's15\n', 's15\n']
04/08/2012 09:39:30 AM - bigjob - DEBUG - Delete /N/u/sivak2606/advert-launcher-machines-sj-685b1308-8199-11e1-9833-e61f1322a75c
04/08/2012 09:39:35 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:35 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:35 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:39:40 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:40 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:40 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:39:45 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:45 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:45 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:39:51 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:51 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:51 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:39:56 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:39:56 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:39:56 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:01 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:01 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:01 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:06 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:06 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:06 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:11 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:11 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:11 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:16 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:16 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:16 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:21 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:21 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:21 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:26 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:26 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.
04/08/2012 09:40:26 AM - bigjob - DEBUG - Monitor jobs - # current jobs: 4
04/08/2012 09:40:31 AM - bigjob - DEBUG - Pilot State: {'state': 'Running', 'stopped': 'False'}
04/08/2012 09:40:31 AM - bigjob - DEBUG - Pilot job entry: bigjob:bj-4ab3dd80-8199-11e1-96da-e61f1322a75c:sierra.futuregrid.org exists. Pilot job not in state stopped.

BigJob Doesn't Terminate if Queueing System Removes Pilot

If a Pilot Job runs out of wall time or gets terminated, BigJob doesn't seem the be very impressed and happily keeps on reporting:

06/05/2012 01:58:29 PM - bigjob - DEBUG - Compute Unit - State: Running
06/05/2012 01:58:31 PM - bigjob - DEBUG - Compute Unit - State: Running

Some sort of a mechanism needs to be implemented to prevent this from happening, e.g., the Agent needs to write a periodical heartbeat to the coordination backend.

Log File generation for BigJob.

Hi!

Currently BIGJOB_VERBOSE prints the trace messages on screen. It would be better to write all the log messages into a bigjob log file, which helps to check logs for long running jobs.

Usability issue: logging output & formatting

BigJob seems to run in some crazy debug level by default. If I run a simple example without BIGJOB_VERBOSE set, I get the output below, which is really something I don't want to see as a user. It looks more like a massive error than anything else, however, the job went trough just fine. However, if something goes wrong, the output looks just like that -- it is nearly impossible to figure out what went wrong.

Also, the way the messages are formatted, i.e., DEBUG:bigjob:Created without date time, etc is really not very pretty and/or legible.

Can this be changed / made more user-friendly? This is the latest BigJob MASTER.

Thanks,
Ole

(pyrep)-bash-3.2$ python bjtest.py 
INFO:bigjob:Loading BigJob version: 0.4.64 on login1
INFO:bigjob:Using SAGA Bliss.
DEBUG:bigjob:['/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/filemanagement/../../../webhdfs-py/', '/home/oweidner', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python27.zip', '/N/u/oweidner/software/pyrep/lib/python2.7', '/N/u/oweidner/software/pyrep/lib/python2.7/plat-linux2', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-tk', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-old', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-dynload', '/N/soft/python/2.7/lib/python2.7', '/N/soft/python/2.7/lib/python2.7/plat-linux2', '/N/soft/python/2.7/lib/python2.7/lib-tk', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages', '../..', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/bigjob', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/impl/../..', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/filemanagement/../..']
WARNING:bigjob:WebHDFS package not found.
WARNING:bigjob:Globus Online package not found.
DEBUG:bigjob:Created Pilot Compute Service: redis://[email protected]/pcs/pcs-4698ee74-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:start bigjob at: fork://localhost
DEBUG:root:['/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/coordination/../ext/redis-2.4.9/', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/filemanagement/../..', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/filemanagement/../../../webhdfs-py/', '/home/oweidner', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg', '/N/u/oweidner/software/pyrep/lib/python27.zip', '/N/u/oweidner/software/pyrep/lib/python2.7', '/N/u/oweidner/software/pyrep/lib/python2.7/plat-linux2', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-tk', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-old', '/N/u/oweidner/software/pyrep/lib/python2.7/lib-dynload', '/N/soft/python/2.7/lib/python2.7', '/N/soft/python/2.7/lib/python2.7/plat-linux2', '/N/soft/python/2.7/lib/python2.7/lib-tk', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages', '../..', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/bigjob', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/impl/../..', '/N/u/oweidner/software/pyrep/lib/python2.7/site-packages/BigJob-0.4.64-py2.7.egg/pilot/filemanagement/../..']
DEBUG:bigjob:Utilizing Redis Backend
DEBUG:bigjob:Parsing URL: redis://[email protected]
INFO:Runtime:BLISS runtime instance created at 0xe6cc68
INFO:Runtime:Found plugin saga.plugin.job.local supporting URL schmemas ['fork'] and API type(s) ['saga.job']
INFO:Runtime:Plugin saga.plugin.job.local internal sanity check passed
INFO:Runtime:Registered plugin saga.plugin.job.local as handler for URL schema fork://
INFO:Runtime:Found plugin saga.plugin.job.pbssh supporting URL schmemas ['pbs+ssh', 'pbs', 'torque', 'torque+ssh', 'xt5torque', 'xt5torque+ssh'] and API type(s) ['saga.job', 'saga.sd']
INFO:Runtime:Plugin saga.plugin.job.pbssh internal sanity check passed
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema pbs+ssh://
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema pbs://
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema torque://
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema torque+ssh://
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema xt5torque://
INFO:Runtime:Registered plugin saga.plugin.job.pbssh as handler for URL schema xt5torque+ssh://
INFO:Runtime:Found plugin saga.plugin.job.sgessh supporting URL schmemas ['sge+ssh', 'sge'] and API type(s) ['saga.job', 'saga.sd']
INFO:Runtime:Plugin saga.plugin.job.sgessh internal sanity check passed
INFO:Runtime:Registered plugin saga.plugin.job.sgessh as handler for URL schema sge+ssh://
INFO:Runtime:Registered plugin saga.plugin.job.sgessh as handler for URL schema sge://
INFO:Runtime:Found plugin saga.plugin.file.sftp supporting URL schmemas ['sftp'] and API type(s) ['saga.filesystem']
INFO:Runtime:Plugin saga.plugin.file.sftp internal sanity check passed
INFO:Runtime:Registered plugin saga.plugin.file.sftp as handler for URL schema sftp://
INFO:Runtime:Found plugin saga.plugin.job.ssh supporting URL schmemas ['ssh'] and API type(s) ['saga.job']
INFO:Runtime:Plugin saga.plugin.job.ssh internal sanity check passed
INFO:Runtime:Registered plugin saga.plugin.job.ssh as handler for URL schema ssh://
DEBUG:bigjob:redis:// gw68.quarry.iu.teragrid.org None
DEBUG:bigjob:Connect to Redis: gw68.quarry.iu.teragrid.org Port: 6379
DEBUG:bigjob:init BigJob w/: redis://[email protected]
DEBUG:bigjob:initialized BigJob: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:create pilot job entry on backend server: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost
DEBUG:bigjob:update state of pilot job to: Unknown stopped: False
DEBUG:bigjob:update description of pilot job to: None
DEBUG:bigjob:set pilot state to: Unknown
DEBUG:bigjob:Use SSH backend
DEBUG:bigjob:SSH: connect to: localhost
DEBUG:bigjob:BigJob working directory: ssh://localhost/home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:Directory not found: /home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:Create directory at: localhost
DEBUG:bigjob:Stage: None to ssh://localhost/home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:BJ Working Directory: /home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:Adaptor specific modifications: fork
DEBUG:bigjob:Escape Bliss
DEBUG:bigjob:"import sys
import os
import urllib
import sys
import time
start_time = time.time()
home = os.environ.get(\"HOME\")
#print \"Home: \" + home
if home==None: home = os.getcwd()
BIGJOB_AGENT_DIR= os.path.join(home, \".bigjob\")
if not os.path.exists(BIGJOB_AGENT_DIR): os.mkdir (BIGJOB_AGENT_DIR)
BIGJOB_PYTHON_DIR=BIGJOB_AGENT_DIR+\"/python/\"
if not os.path.exists(BIGJOB_PYTHON_DIR): os.mkdir(BIGJOB_PYTHON_DIR)
BOOTSTRAP_URL=\"https://raw.github.com/saga-project/BigJob/master/bootstrap/bigjob-bootstrap.py\"
BOOTSTRAP_FILE=BIGJOB_AGENT_DIR+\"/bigjob-bootstrap.py\"
#ensure that BJ in .bigjob is upfront in sys.path
sys.path.insert(0, os.getcwd() + \"/../\")
#sys.path.insert(0, /User/luckow/.bigjob/python/lib\")
#sys.path.insert(0, os.getcwd() + \"/../../\")
p = list()
for i in sys.path:
   if i.find(\".bigjob/python\")>1:
         p.insert(0, i)
for i in p: sys.path.insert(0, i)
print \"Python path: \" + str(sys.path)
print \"Python version: \" + str(sys.version_info)
try: import saga
except: print \"SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.\";
try: import bigjob.bigjob_agent
except: 
   print \"BigJob not installed. Attempt to install it.\"; 
   opener = urllib.FancyURLopener({}); 
   opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE); 
   print \"Execute: \" + \"python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR
   os.system(\"/usr/bin/env\")
   try:
       os.system(\"python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR); 
       activate_this = BIGJOB_PYTHON_DIR+'bin/activate_this.py'; 
       execfile(activate_this, dict(__file__=activate_this))
   except:
       print \"BJ installation failed. Trying system-level python (/usr/bin/python)\";
       os.system(\"/usr/bin/python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR); 
       activate_this = BIGJOB_PYTHON_DIR+'bin/activate_this.py'; 
       execfile(activate_this, dict(__file__=activate_this))
#try to import BJ once again
import bigjob.bigjob_agent
# execute bj agent
args = list()
args.append(\"bigjob_agent.py\")
args.append(\"redis://[email protected]:6379\")
args.append(\"bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost\")
args.append(\"MyTestQueue\")
print \"Bootstrap time: \" + str(time.time()-start_time)
print \"Starting BigJob Agents with following args: \" + str(args)
bigjob_agent = bigjob.bigjob_agent.bigjob_agent(args)
"
DEBUG:bigjob:Working directory: /home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:use standard proxy
INFO:Runtime:Instantiated plugin 'saga.plugin.job.local' for URL scheme fork:// and API type 'saga.job'
INFO:LocalJobPlugin:Registered new service object 
INFO:Service(0xedb650):Bound to plugin 
DEBUG:bigjob:Creating pilot job with description: {'Executable' : '/usr/bin/env','WorkingDirectory' : '/home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696','SPMDVariation' : 'single','WallTimeLimit' : '3600','Arguments' : '['python', '-c', '"import sys\nimport os\nimport urllib\nimport sys\nimport time\nstart_time = time.time()\nhome = os.environ.get(\\"HOME\\")\n#print \\"Home: \\" + home\nif home==None: home = os.getcwd()\nBIGJOB_AGENT_DIR= os.path.join(home, \\".bigjob\\")\nif not os.path.exists(BIGJOB_AGENT_DIR): os.mkdir (BIGJOB_AGENT_DIR)\nBIGJOB_PYTHON_DIR=BIGJOB_AGENT_DIR+\\"/python/\\"\nif not os.path.exists(BIGJOB_PYTHON_DIR): os.mkdir(BIGJOB_PYTHON_DIR)\nBOOTSTRAP_URL=\\"https://raw.github.com/saga-project/BigJob/master/bootstrap/bigjob-bootstrap.py\\"\nBOOTSTRAP_FILE=BIGJOB_AGENT_DIR+\\"/bigjob-bootstrap.py\\"\n#ensure that BJ in .bigjob is upfront in sys.path\nsys.path.insert(0, os.getcwd() + \\"/../\\")\n#sys.path.insert(0, /User/luckow/.bigjob/python/lib\\")\n#sys.path.insert(0, os.getcwd() + \\"/../../\\")\np = list()\nfor i in sys.path:\n    if i.find(\\".bigjob/python\\")>1:\n          p.insert(0, i)\nfor i in p: sys.path.insert(0, i)\nprint \\"Python path: \\" + str(sys.path)\nprint \\"Python version: \\" + str(sys.version_info)\ntry: import saga\nexcept: print \\"SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.\\";\ntry: import bigjob.bigjob_agent\nexcept: \n    print \\"BigJob not installed. Attempt to install it.\\"; \n    opener = urllib.FancyURLopener({}); \n    opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE); \n    print \\"Execute: \\" + \\"python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR\n    os.system(\\"/usr/bin/env\\")\n try:\n        os.system(\\"python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR); \n        activate_this = BIGJOB_PYTHON_DIR+\'bin/activate_this.py\'; \n        execfile(activate_this, dict(__file__=activate_this))\n    except:\n        print \\"BJ installation failed. Trying system-level python (/usr/bin/python)\\";\n        os.system(\\"/usr/bin/python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR); \n        activate_this = BIGJOB_PYTHON_DIR+\'bin/activate_this.py\'; \n        execfile(activate_this, dict(__file__=activate_this))\n#try to import BJ once again\nimport bigjob.bigjob_agent\n# execute bj agent\nargs = list()\nargs.append(\\"bigjob_agent.py\\")\nargs.append(\\"redis://[email protected]:6379\\")\nargs.append(\\"bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost\\")\nargs.append(\\"MyTestQueue\\")\nprint \\"Bootstrap time: \\" + str(time.time()-start_time)\nprint \\"Starting BigJob Agents with following args: \\" + str(args)\nbigjob_agent = bigjob.bigjob_agent.bigjob_agent(args)\n"']','Error' : '/home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696/stderr-bj-46bf294a-b97b-11e1-bc99-a4badb0c3696-agent.txt','Output' : '/home/oweidner/agent/bj-46bf294a-b97b-11e1-bc99-a4badb0c3696/stdout-bj-46bf294a-b97b-11e1-bc99-a4badb0c3696-agent.txt','TotalCPUCount' : '8',}
INFO:Runtime:Found an existing plugin instance for url scheme fork://: []
INFO:Job(0xedb7d0):Bound to plugin 
DEBUG:bigjob:Submit pilot job to: fork://localhost
INFO:LocalJobPlugin:Trying to run: /usr/bin/env python -c "import sys
import os
import urllib
import sys
import time
start_time = time.time()
home = os.environ.get(\"HOME\")
#print \"Home: \" + home
if home==None: home = os.getcwd()
BIGJOB_AGENT_DIR= os.path.join(home, \".bigjob\")
if not os.path.exists(BIGJOB_AGENT_DIR): os.mkdir (BIGJOB_AGENT_DIR)
BIGJOB_PYTHON_DIR=BIGJOB_AGENT_DIR+\"/python/\"
if not os.path.exists(BIGJOB_PYTHON_DIR): os.mkdir(BIGJOB_PYTHON_DIR)
BOOTSTRAP_URL=\"https://raw.github.com/saga-project/BigJob/master/bootstrap/bigjob-bootstrap.py\"
BOOTSTRAP_FILE=BIGJOB_AGENT_DIR+\"/bigjob-bootstrap.py\"
#ensure that BJ in .bigjob is upfront in sys.path
sys.path.insert(0, os.getcwd() + \"/../\")
#sys.path.insert(0, /User/luckow/.bigjob/python/lib\")
#sys.path.insert(0, os.getcwd() + \"/../../\")
p = list()
for i in sys.path:
   if i.find(\".bigjob/python\")>1:
         p.insert(0, i)
for i in p: sys.path.insert(0, i)
print \"Python path: \" + str(sys.path)
print \"Python version: \" + str(sys.version_info)
try: import saga
except: print \"SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.\";
try: import bigjob.bigjob_agent
except: 
   print \"BigJob not installed. Attempt to install it.\"; 
   opener = urllib.FancyURLopener({}); 
   opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE); 
   print \"Execute: \" + \"python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR
   os.system(\"/usr/bin/env\")
   try:
       os.system(\"python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR); 
       activate_this = BIGJOB_PYTHON_DIR+'bin/activate_this.py'; 
       execfile(activate_this, dict(__file__=activate_this))
   except:
       print \"BJ installation failed. Trying system-level python (/usr/bin/python)\";
       os.system(\"/usr/bin/python \" + BOOTSTRAP_FILE + \" \" + BIGJOB_PYTHON_DIR); 
       activate_this = BIGJOB_PYTHON_DIR+'bin/activate_this.py'; 
       execfile(activate_this, dict(__file__=activate_this))
#try to import BJ once again
import bigjob.bigjob_agent
# execute bj agent
args = list()
args.append(\"bigjob_agent.py\")
args.append(\"redis://[email protected]:6379\")
args.append(\"bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost\")
args.append(\"MyTestQueue\")
print \"Bootstrap time: \" + str(time.time()-start_time)
print \"Starting BigJob Agents with following args: \" + str(args)
bigjob_agent = bigjob.bigjob_agent.bigjob_agent(args)
"
INFO:LocalJobPlugin:Started local process: /usr/bin/env ['python', '-c', '"import sys\nimport os\nimport urllib\nimport sys\nimport time\nstart_time = time.time()\nhome = os.environ.get(\\"HOME\\")\n#print \\"Home: \\" + home\nif home==None: home = os.getcwd()\nBIGJOB_AGENT_DIR= os.path.join(home, \\".bigjob\\")\nif not os.path.exists(BIGJOB_AGENT_DIR): os.mkdir (BIGJOB_AGENT_DIR)\nBIGJOB_PYTHON_DIR=BIGJOB_AGENT_DIR+\\"/python/\\"\nif not os.path.exists(BIGJOB_PYTHON_DIR): os.mkdir(BIGJOB_PYTHON_DIR)\nBOOTSTRAP_URL=\\"https://raw.github.com/saga-project/BigJob/master/bootstrap/bigjob-bootstrap.py\\"\nBOOTSTRAP_FILE=BIGJOB_AGENT_DIR+\\"/bigjob-bootstrap.py\\"\n#ensure that BJ in .bigjob is upfront in sys.path\nsys.path.insert(0, os.getcwd() + \\"/../\\")\n#sys.path.insert(0, /User/luckow/.bigjob/python/lib\\")\n#sys.path.insert(0, os.getcwd() + \\"/../../\\")\np = list()\nfor i in sys.path:\n    if i.find(\\".bigjob/python\\")>1:\n          p.insert(0, i)\nfor i in p: sys.path.insert(0, i)\nprint \\"Python path: \\" + str(sys.path)\nprint \\"Python version: \\" + str(sys.version_info)\ntry: import saga\nexcept: print \\"SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.\\";\ntry: import bigjob.bigjob_agent\nexcept: \n    print \\"BigJob not installed. Attempt to install it.\\"; \n    opener = urllib.FancyURLopener({}); \n    opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE); \n    print \\"Execute: \\" + \\"python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR\n    os.system(\\"/usr/bin/env\\")\n    try:\n        os.system(\\"python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR); \n        activate_this = BIGJOB_PYTHON_DIR+\'bin/activate_this.py\'; \n        execfile(activate_this, dict(__file__=activate_this))\n    except:\n        print \\"BJ installation failed. Trying system-level python (/usr/bin/python)\\";\n        os.system(\\"/usr/bin/python \\" + BOOTSTRAP_FILE + \\" \\" + BIGJOB_PYTHON_DIR); \n        activate_this = BIGJOB_PYTHON_DIR+\'bin/activate_this.py\'; \n        execfile(activate_this, dict(__file__=activate_this))\n#try to import BJ once again\nimport bigjob.bigjob_agent\n# execute bj agent\nargs = list()\nargs.append(\\"bigjob_agent.py\\")\nargs.append(\\"redis://[email protected]:6379\\")\nargs.append(\\"bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost\\")\nargs.append(\\"MyTestQueue\\")\nprint \\"Bootstrap time: \\" + str(time.time()-start_time)\nprint \\"Starting BigJob Agents with following args: \\" + str(args)\nbigjob_agent = bigjob.bigjob_agent.bigjob_agent(args)\n"']
DEBUG:bigjob:Create PilotCompute for BigJob: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost
DEBUG:bigjob:Submit CU to big-job
DEBUG:bigjob:add subjob to queue of PJ: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost
DEBUG:bigjob:create dictionary for job description. Job-URL: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost:jobs:sj-4729b92c-b97b-11e1-bc99-a4badb0c3696
DEBUG:bigjob:SJ Attributes: {'Executable' : '/bin/cat','NumberOfProcesses' : '1','SPMDVariation' : 'single','Arguments' : '['/etc/passwd']','Error' : 'stderr.txt','Output' : 'stdout.txt',}
DEBUG:root:Finished submission. Waiting for completion of CU
DEBUG:bigjob:Compute Unit: cu-4729acac-b97b-11e1-bc99-a4badb0c3696, State: Unknown
DEBUG:bigjob:Compute Unit: cu-4729acac-b97b-11e1-bc99-a4badb0c3696, State: Running
DEBUG:bigjob:Compute Unit: cu-4729acac-b97b-11e1-bc99-a4badb0c3696, State: Running
DEBUG:bigjob:Compute Unit: cu-4729acac-b97b-11e1-bc99-a4badb0c3696, State: Done
DEBUG:root:Terminate Pilot Compute Service
DEBUG:bigjob:Cancel Pilot Job
DEBUG:bigjob:stop pilot job: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost
DEBUG:bigjob:update state of pilot job to: Done stopped: True
DEBUG:bigjob:delete pilot job: bigjob:bj-46bf294a-b97b-11e1-bc99-a4badb0c3696:localhost
DEBUG:bigjob:Cancel Pilot Job finished

Syntax Error in Latest BigJob

I get the following error message from the agent:

(bigjob-bliss)-bash-3.00$ cat agent/stderr-bj-81d32a2a-af1e-11e1-84e0-0060dd46c5e6-agent.txt 
/var/spool/torque/mom_priv/jobs/625554.qb2.SC: line 29: syntax error near unexpected token `)'
/var/spool/torque/mom_priv/jobs/625554.qb2.SC: line 29: `#sys.path.insert(0, /User/luckow/.bigjob/python/lib")'

I'm using the latest bigjob 'master' installed via

pip  install --upgrade -e git://github.com/saga-project/BigJob.git#egg=BigJob

BigJob creates unnecessary file output.tar.gz in the working directory.

BJ Version - 0.4.47
Machine - sierra.

-rw-r--r-- 1 pmantha users 2323 Apr 7 15:27 example-pilot-api.py
drwxr-xr-x 3 pmantha users 4 Apr 7 15:31 work
(python)-bash-3.2$ cd work/
(python)-bash-3.2$ ls -ltr
total 42
drwxr-xr-x 3 pmantha users 5 Apr 7 15:31 bj-67536476-8101-11e1-b619-002215124496
-rw-r--r-- 1 pmantha users 40960 Apr 7 15:31 output.tar.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.