The ood_core from osc

LSF adapter returns `nil` instead of `Info` object

The Adapter#info method should only return an Info object and not nil. The LSF adapter returns nil here:

ood_core/lib/ood_core/job/adapters/lsf.rb

Line 92 in b80a6ee

job ? info_for_batch_hash(job) : nil

LSF Adapter: add NodeInfo to Info using #procs for slots

Notes from an email:

Technically, LSF uses a construct called “job slots” which typically, as in our system, is configured to correspond to a core (although it needn’t necessarily). If a job runs on one job slot on one node, the execution hosts would be reported like “compute010”; two slots on a single node, it would go to “2compute010” and on up to multiple nodes where a 48-slot job, for instance, running on all slots on three nodes would be reported as “16compute010 16compute011 16compute012”. Mixed numbers like “5compute010 11compute011 16*compute012” are valid as well.

Strip leading and trailing whitespace for some `Job::Script` attributes

This was brought to our attention in the Service Now ticket INC0319296.

Basically a user was trying to submit a job with an account string that had a leading whitespace. So it was being submitted as:

qsub -A ' ACCOUNT'

The Job::Script#accounting_id should probably not return a string with leading and trailing whitespace. This argument could probably be made for a few other attributes such as #job_name, #reservation_id, and etc.

┆Issue is synchronized with this Asana task by Unito

LSF Adapter: Explore joining stdout & stderr by default

Explore joining the stdout and stderr files in LSF by setting their paths to the same filename.

Torque adapter should support an array for native

Right now the Torque adapter is the ood-ball for the native arguments. Every other adapter accepts an array of command line arguments, and the translation from job script headers to this array is easy to do once you understand the convention.

But with Torque, it is a hash, and for each PBS header, I have to find the corresponding key that pbs-ruby gem uses and then use that in the hash.

A simple solution would be to extend the Torque adapter to accept a Hash or an Array and respond accordingly. I'm not sure, however, if it is easy in pbs-ruby to support arbitrary command line arguments to qsub.

Add job name and account to info object

We are currently building LSF and Slurm adapters. These (and probably future adapters) should support the ability to get:

job name
account that a job is charged to

Export helper methods making available in forked app script

The Bash helper methods should be exported so they are available in the forked script.sh file.

Duplication between Info#procs and Info#allocated_nodes?

      # Set of machines that is utilized for job execution
      # @return [Array<NodeInfo>] allocated nodes
      attr_reader :allocated_nodes

and

      # Number of procs allocated for job
      # @return [Fixnum, nil] allocated total number of procs
      attr_reader :procs

In what case will this be true:

info.allocated_nodes(&:procs).reduce(&:procs) != info.procs

No method `#empty?` on OpenStruct

I believe this is a bug and should be fixed:

ood_core/lib/ood_core/cluster.rb

Line 67 in 2b9eab4

allow? && !login.empty?

Batch Connect apps should be able to override main script template

Responding to this issue: #55

The question is, if we didn't fix #55 and a user wanted to fix this in an interactive app, without modifying the dashboard code, how would they do it?

┆Issue is synchronized with this Asana task by Unito

Update comments with Fixnum to Integer

Fixnum is being deprecated in Ruby 2.4+ so we should replace it with Integer.

Script#workdir as a Pathname object may be problematic for LSF 9+

Documentation for LSF 9.1.2: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.2/lsf_command_ref/bsub.1.html

-cwd "current_working_directory"
Specifies the current working directory for job execution. The system creates the CWD if the path for the CWD includes dynamic patterns for both absolute and relative paths. LSF cleans the created CWD based on the time to live value set in the JOB_CWD_TTL parameter of the application profile or in lsb.params.
The path can include the following dynamic patterns:

%J - job ID

%JG - job group (if not specified, it will be ignored)

%I - index (default value is 0)

%EJ - execution job ID

%EI - execution index

%P - project name

%U - user name

%G - user group

For example, the following command creates /scratch/jobcwd/user1/_0/ for the job CWD:
bsub -cwd "/scratch/jobcwd/%U/%J_%I" myjob

Of course, this is an LSF specific feature for job submission.

We should change the documentation of Script#workdir to return a String, not a Pathname object. We would stop coercing workdir into a Pathname and just keep it as a String. The adapters would need to be updated accordingly, but since no one uses the accessors on Script outside of the adapters it should be a safe change.

The other option is the adapter should just use -cwd to the provided path if workdir is set, and then for LSF users if they want a path with "dynamic patterns" they can add that to the script headers. We would just need to be careful in apps like My Jobs to not "always set" the workdir to the job directory using Script#workdir because then we would remove the ability to customize the script via the headers in this regard. In that case, we would instead be sure to always cd to the desired job directory and then execute the script from there.

Note: LSF 8.3 doesn't support these dynamic parameters in cwd

LSF Adapter: Add support for LSF9+

In LSF9+ bjobs offers more flags for getting more output and regularly formatted output from bjobs. We should be able to get accurate runtime and other attributes.

Accounting for multiple queues available

In the cluster config, for jobs, we have this: (see original discussion OSC/ood_appkit#36)

jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"

Notice, there is no queue information listed. In particular, what default queue should be used, and what is the list of all queues available? Each resource managers offers a method to get a queue list (qstat -Q, squeue, bqueues etc.). Perhaps ood_job adapters should have a method to return a list of queues (and information about them?) available for each resource manager.

At OSC we currently can submit jobs without specifying the queue and the job ends up in the appropriate queue. At other centers this may not be the place. For example, at TSC documentation their asks users to specify the queue they want to submit to as a header in their batch scripts.

So the following issue exists:

whether or not queue is a required argument when submitting a job
if queue is required, how to get a list of available queues
(optional) a default queue to use out of that list, if one must be set

┆Issue is synchronized with this Asana task by Unito

Change definition of `Script#min_phys_memory`

Currently the definition of Script#min_phys_memory is:

The minimum amount of physical memory in kilobyte that should be available for the job

This is not possible for Slurm. Slurm only has:

The minimum amount of physical memory in kilobyte per node that should be available for the job

So one possible change in the definition is:

The minimum amount of physical memory in kilobyte across all nodes or per node (dependent upon the resource manager) that should be available for the job

Note: We can't compute the memory per node or memory across all nodes because the Script object may not know the number of nodes being requested.

Add feature to wait for web server port

The iHPC app's panel shows as "Running" in the Dashboard when the connection file is found in the Interactive Session's working directory. This file is generated right after the iHPC app's script is forked off, but not necessarily when the web server is fully loaded from within that script. This leads to the those "unable to connect" errors very early on for the user.

One option is to wait until the web server is fully loaded before providing the user with the "Connect To Server" button in the Dashboard panel. So we would need a Bash helper method that waits until the specified port that the web server listens on is used. Then we use this method right before we generate the connection file in the after.sh script.

An example being...

# after.sh

# Wait for the Jupyter Notebook server to start
echo "Waiting for Jupyter Notebook server to open port ${port}..."
if wait_until_port_used "${host}:${port}" 60; then
  echo "Discovered Jupyter Notebook listening on port ${port}!"
else
  echo "Timed out waiting for Jupyter Notebook to open port ${port}!" ; exit 1
fi
sleep 2

LSF Adapter: Add support all job submission options

Most important:

1. Script#native
2. job_environment

Other Script parameters that look like they are supported:

These don't look like they are supported

args
join_files
min_phys_memory -M [MB] claims rosetta; bjobs man page for 8.3 says:

-M mem_limit
Sets a per-process (soft) memory limit for all the
processes that belong to this batch job (see
getrlimit(2)).

By default, the limit is specified in KB. Use
LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
larger unit for the limit (MB, GB, TB, PB, or EB).

Decide on resource management abstractions

Problems like how to abstract this to be able to use non--TORQUE/non-PBS-based systems, modular, etc.

LSF Adapter: add "estimated runtime" by subtracting current time from start time

It is an estimate. If the job is never suspended after starting, this will be accurate.

LSF 9+ offers the ability to modify the output of the bjobs and specify runtime in the output. So we will be able to provide a more accurate runtime for later versions at that time.

This will allow an empty column in Active Jobs to be set with a value.

LSF Adapter: Verify job arrays do not produce problems for My Jobs and Active Jobs using the adapter

┆Issue is synchronized with this Asana task by Unito

The used_port helper fails if no host specified

The used_port bash helper in batch connect fails if no host is specified. You get:

$ expr "22" : '\(.*\):' 2>/dev/null || echo "localhost"

localhost

Notice the new line that should not be present. But it is successful if you pass in a host:

$ expr "host:22" : '\(.*\):' 2>/dev/null || echo "localhost"
host

Slurm node list

After using TACC I noticed a new format that the node list can come in:

c427-032,c429-002

I do not believe this is covered by the current Slurm adapter. In fact, I need to test the following formats:

c457-[011-012]
c439-021,c450-033
c439-[121-122]
c438-[062,104]
c433-[011,013]
c438-[052-053]
c431-[012,072]
c427-032,c429-002
c410-102,c414-004
c457-[001-002]
c474-[004,022]
c452-[054,121]
c453-[101,112]
c454-[021,064]

LSF Adapater: Add multi-cluster mode support

We need a site that uses LSF in Multi-Cluster mode to support this

Slurm multi-database support

Look into adding multi-db support for Slurm.

Support `headers.sh` for adding directives

This would be loaded above the script_wrapper, if it exists in the template.

The goal of this would be to make it easier to add custom arguments, using headers that people are used to working with, instead of having the only option be modifying submit.yml, which can be more challenging because you have to translate the header directives to either array or hash based on which adapter you are using.

Also, in the future we may support adapters that use the C library for LSF, Slurm, PBSPro, etc. like we do for torque. The headers.sh would remain the same for the specific resource manager, regardless of the adapter type used.

All of our example apps could have a headers.sh with a single comment # add custom resource manager directives here.

Drop min_phys_memory from OodCore::Job::Script

Remove Job::NodeRequest

Due to the complexity in requesting nodes, tasks, cores, gpus, and other properties on a node on PBS, Slurm, and LSF it may be best to remove support for OodCore::Job::NodeRequest for the time being.

If an app wants to request node-like options then it will use the #native feature for the corresponding resource manager library.

Inconsistency in slurm spec tests?

In slurm_spec we have this:

describe "#submit" do
  def build_script(opts = {})
    OodCore::Job::Script.new(
      {
        content: content
      }.merge opts
    )
  end
  # ...
  subject { adapter.submit(script: build_script) }

  it "returns job id" do
    is_expected.to eq("job.123")
    expect(slurm).to have_received(:submit_string).with(content, args: [], env: {})
  end

  context "with :queue_name" do
    before { adapter.submit(script: build_script(queue_name: "queue")) }

    it { expect(slurm).to have_received(:submit_string).with(content, args: ["-p", "queue"], env: {}) }
  end

We specify subject to be the return of the method call adapter.submit(script: build_script). But this subject is only used for one test, it "returns job id" do as what follows are multiple contexts, where the "subject" of the context is actually in the before.

Would it be more appropriate for the initial test to work the same way? According to http://betterspecs.org/#subject the user of subject is for multiple tests sharing the same subject, but we don't seem to have that here.

Just trying to understand, see if I'm missing something.

LSF job not ending if batch script exits

If the batch script exits but the forked off template/script.sh is still running then LSF keeps the batch job alive.

This is problematic as I have the batch script exit if it times out waiting for the forked server to open its assigned port. The user will then see their Session in a perpetual "Starting..." state.

OodCluster needs documentation

OodCore is pretty much only documented at the code-level at this point.

The OodCluster object in particular is in need of README-level documentation of it's public methods to really be usable for app development by outsiders.
https://github.com/OSC/ood_core/blob/master/lib/ood_core/cluster.rb

At 0.0.4, this repo probably isn't stable enough to undertake a full documentation workup, but I wanted to put it out there as a pain point.

┆Issue is synchronized with this Asana task by Unito

Add walltime requested to `Job::Info`

I think we should add the total requested wall time to the OodCore::Job::Info object. It could be called wallclock_requested or wallcock_total.

Deprecate `v1` backwards compatibility?

I feel we can safely deprecate the following code:

ood_core/lib/ood_core/clusters.rb

Lines 47 to 88 in 66524f8

    
           # Parse a list of clusters from a 'v1' config 
        
           # NB: Makes minimum assumptions about config 
        
           def parse_v1(id:, cluster:) 
        
             c = { 
        
               id: id, 
        
               metadata: {}, 
        
               login: {}, 
        
               job: {}, 
        
               acls: [], 
        
               custom: {} 
        
             } 
        
             c[:metadata][:title]   = cluster["title"] if cluster.key?("title") 
        
             c[:metadata][:url]     = cluster["url"]   if cluster.key?("url") 
        
             c[:metadata][:private] = true             if cluster["cluster"]["data"]["hpc_cluster"] == false 
        
             if l = cluster["cluster"]["data"]["servers"]["login"] 
        
               c[:login][:host] = l["data"]["host"] 
        
             end 
        
             if rm = cluster["cluster"]["data"]["servers"]["resource_mgr"] 
        
               c[:job][:adapter] = "torque" 
        
               c[:job][:host]    = rm["data"]["host"] 
        
               c[:job][:lib]     = rm["data"]["lib"] 
        
               c[:job][:bin]     = rm["data"]["bin"] 
        
               c[:job][:acls]    = [] 
        
             end 
        
             if v = cluster["validators"] 
        
               if vc = v["cluster"] 
        
                 c[:acls] = vc.map do |h| 
        
                   { 
        
                     adapter: "group", 
        
                     groups: h["data"]["groups"], 
        
                     type: h["data"]["allow"] ? "whitelist" : "blacklist" 
        
                   } 
        
                 end 
        
               end 
        
             end 
        
             c 
        
           end

as all HPC centers that I worked with installing OOD uses the new v2 cluster config.

Also the v1 backwards compatibility wouldn't support MyJobs and ActiveJobs.

┆Issue is synchronized with this Asana task by Unito

Hostname doesn't give correct host all the time

This line:

ood_core/lib/ood_core/batch_connect/template.rb

Line 129 in 2086895

"host=$(hostname)\n[[ -e \"#{before_file}\" ]] && source \"#{before_file}\""

uses hostname to get the host of the machine.

At Arizona, this gives:

┌─[jnicklas@i1n5][~]
└─▪ hostname
i1n5

which we are unable to SSH to from the OnDemand node. Maybe this can be fixed by the sys admins, but an alternative solution may need to be looked into. For example:

┌─[jnicklas@i1n5][~]
└─▪ hostname -A
i1n5.ocelote.hpc.arizona.edu i1n5.cm.cluster i1n5.ib.cluster

where I am able to successfully SSH to i1n5.ocelote.hpc.arizona.edu from the OnDemand node.

Define Adapter#tr for localization support

In Qt, localization is used by wrapping a tr method around every string. In Rails, its using http://guides.rubyonrails.org/i18n.html I18n.translate method which has a short t i.e. instead of

  def index
    flash[:notice] = "Hello flash!"
  end

you would do

  def index
    flash[:notice] = t(:hello_flash)
  end

And then in config/locales/en.yml you would have:

en:
  hello_flash: Hello flash!

and config/locales/fr.yml you would have:

fr:
  hello_flash: Bonjour Flash

Typically this works because there is some global value like I18n.locale == 'en' or I18n.locale == 'fr' specifying the locale to use, so when the translate or t method is called it knows what value to return for the key :hello_flash.

In OnDemand's case, our "locale" is the adapter subclass type being used (Slurm, PBS Pro, Torque) and an example of a word that needs internalized is "queue" (for Torque) and "partition" (for Slurm).

So instead we should use something like:

Adapter#tr(:queue) which the base class returns "queue" (just :queue.to_s) and the Slurm adapter returns "partition" and the Torque adapter returns "queue" i.e. base class implementation:

def tr(word)
  word.to_s
end

and subclass:

def tr(word)
  { queue: "partition" }.fetch(word, super(word))
end

┆Issue is synchronized with this Asana task by Unito

Export host and port

The host and port env vars defined in before.sh should be made available to the forked script.sh file. This can be done by exporting them.

Do not export the passwd env var though.

Should we keep maintaining separate gems for resource manager adapters?

This issue is to capture a discussion around the merits of continuing to maintains separate gems for resource manager adapters, now that we have ood_core.

Original suggestion from @nickjer on the LSF performance issues sparked this discussion:

@nickjer:

I feel if you go this route, it may be best to break this off into a separate gem much like pbs-ruby.

@ericfranz:

I don't think we will realize any benefits by breaking this off into a separate gem.

Add an Job::Adapter#info_where as alternative to info_all

An example: Adapter#info_where(user: "efranz") or maybe better is Adapter#info_where_user("efranz"). We could also add Adapter#info_where_queue("debug") etc.

The Adapter superclass can offer a default implementation, which does a select on the results from info_all.

Individual adapters can optionally override with an optimized implementation. For example, bjobs by default shows only the user's jobs.

It could also be a Adapter#info_where filter and accept a hash where keys are methods on Job::Info.

LSF Adapter: Address performance issues

Need to do a little performance analysis. A better algorithm might do. However, there is a lot of parsing and string manipulation going on. Using the C library might make it faster (via Fiddle http://ruby-doc.org/stdlib-2.2.0/libdoc/fiddle/rdoc/Fiddle.html).

It was observed when testing that 30 jobs can take 1-2 seconds and 4000 jobs could take 28 seconds. This is way too long.

Many Fiddle tutorials online. One example: http://blog.honeybadger.io/use-any-c-library-from-ruby-via-fiddle-the-ruby-standard-librarys-best-kept-secret/

See http://publibfp.dhe.ibm.com/epubs/pdf/c2753121.pdf and lsb_openjobinfo() and lsb_readjobinfo() and jobInfoEnt structure. Also, here is some example code that uses the C library: https://github.com/PlatformLSF/lsf-drmaa/blob/025f9c49af48e410dc0ab0b9c611c42935dc09eb/lsf_drmaa/job.c

┆Issue is synchronized with this Asana task by Unito

Does not detect listening port if on specific ip

If a server is listening on a specific ip and port combination, then the bash helpers do not properly detect if the port is open. That is because the bash helpers just check for open ports on localhost.

The helper should allow the user to specify the ip and port combo when checking if the port is being used.

LSF adapter is treating cores as nodes

Requesting a job on a cluster with nodes that have 20 cores per node as such:

bsub -n 10 -R "span[ptile=10]"

will give you a single node with access to 10 out of the 20 cores on it.

This is what I currently see in the LSF adapter when viewing info for that job:

OodAppkit.clusters['ada'].job_adapter.info("7168816").allocated_nodes
=> [
     #<OodCore::Job::NodeInfo:0x000000021fbc18 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000021fb920 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000021fb538 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000021fafc0 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000021fa1b0 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000021f9f08 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000058cbf68 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000058cbec8 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000058cbe28 @name="sx6036-1202", @procs=1>,
     #<OodCore::Job::NodeInfo:0x000000058cbd88 @name="sx6036-1202", @procs=1>
   ]

This should instead be:

OodAppkit.clusters['ada'].job_adapter.info("7168816").allocated_nodes
=> [
     #<OodCore::Job::NodeInfo:0x000000021fbc18 @name="sx6036-1202", @procs=10>
   ]

Debugging info:

$ bjobs -a -w -W 7168816
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME  PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
7168816 jnicklas RUN   sn_short   login7      sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202:sx6036-1202 sys/dashboard/dev/jupyter 01/24-14:53:50 082810563939 000:00:07.00 75     0      31211,31395,31399,31419,31761 01/24-14:53:51 -  10

LSF Adapter: support job dependencies

CentOS 7 uses nmap netcat

The following command will fail on CentOS 7:

ood_core/lib/ood_core/batch_connect/template.rb

Line 105 in 8f21f0c

nc -z localhost ${PORT} &>/dev/null

This should be replaced with:

nc -w 2 -v localhost ${PORT} < /dev/null &> /dev/null

as it works with netcat for both CentOS6 and 7.

Patch Slurm adapter to allow submitting without `-M <cluster>`

I should patch the Slurm adapter to allow an optionally defined cluster.

Getting list of batch server queues/partitions

Community request:

Support getting a list of batch server queues/partitions

Test fails when local time is not Eastern Standard Time

Test fails because its testing the conversion of a timestamp into a year month day... string using localtime conversion. The expected result is hardcoded using Eastern Standard Time.

ood_core/spec/job/adapters/torque_spec.rb

Lines 195 to 199 in 9b197bf

    
           context "with :start_time" do 
        
             before { adapter.submit(script: build_script(start_time: 1478631234)) } 
        
             it { expect(pbs).to have_received(:submit_string).with(content, queue: nil, headers: {Execution_Time: "201611081353.54"}, resources: {}, envvars: {}) } 
        
           end

[efranz@gwdev02 ood_core]$ bundle exec rspec ./spec/job/adapters/torque_spec.rb:198
Run options: include {:locations=>{"./spec/job/adapters/torque_spec.rb"=>[198]}}
F

Failures:

  1) OodCore::Job::Adapters::Torque#submit with :start_time should have received submit_string("my batch script", {:queue=>nil, :headers=>{:Execution_Time=>"201611081353.54"}, :resources=>{}, :envvars=>{}}) 1 time
     Failure/Error: it { expect(pbs).to have_received(:submit_string).with(content, queue: nil, headers: {Execution_Time: "201611081353.54"}, resources: {}, envvars: {}) }

       #<Double (anonymous)> received :submit_string with unexpected arguments
         expected: ("my batch script", {:queue=>nil, :headers=>{:Execution_Time=>"201611081353.54"}, :resources=>{}, :envvars=>{}})
              got: ("my batch script", {:queue=>nil, :headers=>{:Execution_Time=>"201611081253.54"}, :resources=>{}, :envvars=>{}})
       Diff:
       @@ -1,6 +1,6 @@
        ["my batch script",
         {:queue=>nil,
       -  :headers=>{:Execution_Time=>"201611081353.54"},
       +  :headers=>{:Execution_Time=>"201611081253.54"},
          :resources=>{},
          :envvars=>{}}]

     # ./spec/job/adapters/torque_spec.rb:198:in `block (4 levels) in <top (required)>'

Finished in 0.02129 seconds (files took 0.22371 seconds to load)
1 example, 1 failure

Failed examples:

rspec ./spec/job/adapters/torque_spec.rb:198 # OodCore::Job::Adapters::Torque#submit with :start_time should have received submit_string("my batch script", {:queue=>nil, :headers=>{:Execution_Time=>"201611081353.54"}, :resources=>{}, :envvars=>{}}) 1 time

Make some `Info` and `NodeInfo` attributes optional

Make some Info and NodeInfo attributes optional. For example:

job_info.cpu_time
# => nil

Some of these attributes (in particular Info#submit_host, Info#cpu_time, and NodeInfo#procs for Slurm) can not be retrieved for a given resource manager. Setting them nil will make them easy to check for existence and let the app display a default value if it wants.

For example:

<li class="job-info">
  <%= content_tag :ul, "Job Id = #{info.job_id}" %>
  <%# Don't display list item if it doesn't exist %>
  <%= content_tag :ul, "Submit Host = #{info.submit_host}" if info.submit_host %>
  <%# Set a default value for list item if it doesn't exist %>
  <%= content_tag :ul, "CPU Time = #{info.cpu_time || "Not Supported"}" %>
</li>

Drop join_files from Script and update PBS adapter to mimic Slurm and LSF

Split Adapter#info into separate methods?

We have one method that has 2 different ways of executing (one with id specified and one without) and two different return types:

if id is specified, implement algorithm optimized for getting the info of one job, and return a hash
if id is not specified, implement an algorithm optimized for getting the info of all the jobs, and return an array

It seems like we should split these into two separate methods. I think we overlooked this because the underlying implementation uses a single command (qstat, bjobs, squeue).

Ideas: could be info_all and info_find(id:) or just info_all and info(id:). That said, its kinda late in the game to make this change, it might be expensive.

Provide way to source Bash helper methods to remote hosts

The Bash helper methods:

create_passwd
find_port
...

may need to be used on the host machines assigned to the batch job aside from the master node. One example is to start servers on the worker nodes using pbsdsh .... In order to start the servers we need to choose an available port to listen on using the find_port helper function.

One simple way to make the Bash helper methods more portable is to wrap them up in another Bash function such as...

source_helpers () {
  find_port () {
    # ...
  }
  create_passwd () {
    # ...
  }
  # ...
}
export -f source_helpers

by calling source_helpers in the main script, all of those functions are now available to it.

To make it available to pbsdsh scripts we could do...

pbsdsh bash -c "
  $(declare -f source_helpers)
  source_helpers

  ./start_server --port \$(find_port)
" &

The declare statement basically dumps the code for the helper functions in-place. Then we call that function to make the helpers available.

Submitting with native arguments for all adapters that accept arrays should also accept hash

Currently for all adapters not Torque, we can set on the Script#native an array of custom arguments i.e.

["-n", "5"]

Arrays don't work well for merging different sets of submission arguments. We could update these adapters to accept either an array or a hash. One issue with this would be being able to set flags without arguments. A solution could be that any argument that is nil is omitted. Example:

native = { "-n" => "5", "-R" => "span[ptile=2]", "-B" => nil, "-N" => nil }
native.to_a.flatten.compact
# => ["-n", "5", "-R", "span[ptile=2]", "-B", "-N"]

┆Issue is synchronized with this Asana task by Unito

	# Parse a list of clusters from a 'v1' config
	# NB: Makes minimum assumptions about config
	def parse_v1(id:, cluster:)
	c = {
	id: id,
	metadata: {},
	login: {},
	job: {},
	acls: [],
	custom: {}
	}

	c[:metadata][:title] = cluster["title"] if cluster.key?("title")
	c[:metadata][:url] = cluster["url"] if cluster.key?("url")
	c[:metadata][:private] = true if cluster["cluster"]["data"]["hpc_cluster"] == false

	if l = cluster["cluster"]["data"]["servers"]["login"]
	c[:login][:host] = l["data"]["host"]
	end

	if rm = cluster["cluster"]["data"]["servers"]["resource_mgr"]
	c[:job][:adapter] = "torque"
	c[:job][:host] = rm["data"]["host"]
	c[:job][:lib] = rm["data"]["lib"]
	c[:job][:bin] = rm["data"]["bin"]
	c[:job][:acls] = []
	end

	if v = cluster["validators"]
	if vc = v["cluster"]
	c[:acls] = vc.map do \|h\|
	{
	adapter: "group",
	groups: h["data"]["groups"],
	type: h["data"]["allow"] ? "whitelist" : "blacklist"
	}
	end
	end
	end

	c
	end

	context "with :start_time" do
	before { adapter.submit(script: build_script(start_time: 1478631234)) }

	it { expect(pbs).to have_received(:submit_string).with(content, queue: nil, headers: {Execution_Time: "201611081353.54"}, resources: {}, envvars: {}) }
	end

osc / ood_core Goto Github PK

ood_core's Introduction

OodCore

Installation

Usage

Development

Contributing

License

ood_core's People

Contributors

Stargazers

Watchers

Forkers

ood_core's Issues

Recommend Projects

Recommend Topics

Recommend Org