flux-framework / flux-sched Goto Github PK

Fluxion Graph-based Scheduler

License: GNU Lesser General Public License v3.0

Shell 16.24% C 2.08% Python 2.92% C++ 75.09% Dockerfile 0.25% Perl 0.43% Lua 0.23% CMake 1.93% Go 0.82%

hpc job-scheduler scheduling workflows radiuss

flux-sched's Introduction

NOTE: The interfaces of flux-sched are being actively developed and are not yet stable. The github issue tracker is the primary way to communicate with the developers.

Fluxion: An Advanced Graph-Based Scheduler for HPC

Welcome to Fluxion¹, an advanced job scheduling software tool for High Performance Computing (HPC). Fluxion combines graph-based resource modeling with efficient temporal plan management schemes to schedule a wide range of HPC resources (e.g., compute, storage, power etc) in a highly scalable, customizable and effective fashion.

Fluxion has been integrated with flux-core to provide it with both system-level batch job scheduling and nested workflow-level scheduling.

See our resource-query utility, if you want to test your advanced HPC resource modeling and selection ideas with Fluxion in a simplified, easy-to-use environment.

Fluxion Scheduler in Flux

Fluxion introduces queuing and resource matching services to extend Flux to provide advanced batch scheduling. Jobs are submitted to Flux as usual, and Fluxion makes a schedule to assign available resources to the job requests according to its configured algorithm.

Fluxion installs two modules that are loaded by the Flux broker:

sched-fluxion-qmanager, which manages one or more prioritized job queues with configurable queuing policies (fcfs, easy, conservative, or hybrid).
sched-fluxion-resource, which matches resource requests to available resources using Fluxion's graph-based matching algorithm.

Building Fluxion

Fluxion requires an installed flux-core package. Instructions for installing flux-core can be found in the flux-core README.

Click to expand and see our full dependency table

Fluxion also requires the following packages to build:

redhat	ubuntu	version	note
hwloc-devel	libhwloc-dev	>= 1.11.1
boost-devel	libboost-dev	== 1.53 or > 1.58	1
boost-graph	libboost-graph-dev	== 1.53 or > 1.58	1
boost-system	libboost-system-dev	== 1.53 or > 1.58	1
boost-filesystem	libboost-filesystem-dev	== 1.53 or > 1.58	1
boost-regex	libboost-regex-dev	== 1.53 or > 1.58	1
libedit-devel	libedit-dev	>= 3.0
python3-pyyaml	python3-yaml	>= 3.10
yaml-cpp-devel	libyaml-cpp-dev	>= 0.5.1

Note 1 - Boost package versions 1.54-1.58 contain a bug that leads to compilation error.

The following optional dependencies enable additional testing:

redhat	ubuntu	version
valgrind-devel	valgrind
jq	jq

Installing RedHat/CentOS Packages

sudo dnf install hwloc-devel boost-devel boost-graph boost-system boost-filesystem boost-regex libedit-devel python3-pyyaml yaml-cpp-devel

Installing Ubuntu Packages

sudo apt-get update
sudo apt install libhwloc-dev libboost-dev libboost-system-dev libboost-filesystem-dev libboost-graph-dev libboost-regex-dev libedit-dev libyaml-cpp-dev python3-yaml

Clone flux-sched, the repo name for Fluxion, from an upstream repo and prepare for configure:

git clone <flux-sched repo of your choice>
cd flux-sched

Fluxion uses a CMake based build system, and can be configured and built as usual for cmake projects. If you wish you can use one of our presets, the default is a RelWithDebInfo build using ninja that's good for most purposes:

cmake -B build --preset default
cmake --build build
cmake --build build -t install
ctest --test-dir build
# OR
cmake -B build
make -C build
make -C build check
make -C build install
# OR
cmake -B build -G Ninja
ninja -C build
ninja -C build check
ninja -C build install

If you prefer autotools style, we match the rest of the flux project by offering a configure script that will provide the familiar autotools script interface but use cmake underneath.

The build system will attempt to find a flux-core in the same prefix as specified on the command line. If -DCMAKE_INSTALL_PREFIX, or for configure --prefix, is not specified, then it will default to the same prefix as was used to install the first flux executable found in PATH. Therefore, if which flux returns the version of flux-core against which Fluxion should be compiled, then the configuration may be run without any arguments. If flux-core is side-installed, then the prefix should be set to the same prefix as was used to install the target flux-core. For example, if flux-core was installed in $FLUX_CORE_PREFIX:

cmake -B build --preset default -DCMAKE_INSTALL_PREFIX="$FLUX_CORE_PREFIX"
cmake --build build
ctest --test-dir build
cmake --build build -t install
# OR
mkdir build
cd build
../configure --prefix=${FLUX_CORE_PREFIX}
make
make check
make install

To build go bindings, you will need go (tested with 1.19.10) available, and then:

export WITH_GO=yes
cmake -B build
cmake --build build
ctest --test-dir build
cmake --build build -t install

To run just one test, you can cd into t in the build directory, then run the script from the source directory or use the usual ctest options to filter by regex:

$ cd build/t
$ ../../t/t9001-golang-basic.t 
ok 1 - match allocate 1 slot: 1 socket: 1 core (pol=default)
ok 2 - match allocate 2 slots: 2 sockets: 5 cores 1 gpu 6 memory
# passed all 2 test(s)
1..2
# OR
cd build
ctest -R t9001 --output-on-failure

To run full tests (more robust and mimics what happens in CI) you can do:

ctest

Flux Instance

The examples below walk through exercising functioning flux-sched modules (i.e., sched-fluxion-qmanager and sched-fluxion-resource) in a Flux instance. The following examples assume that flux-core and Fluxion were both installed into ${FLUX_CORE_PREFIX}. For greater insight into what is happening, add the -v flag to each flux command below.

Create a comms session comprised of 3 brokers:

${FLUX_CORE_PREFIX}/bin/flux start -s3

This will create a new shell in which you can issue various flux commands such as following.

Check to see whether the qmanager and resource modules are loaded:

flux module list

Submit jobs:

flux submit -N3 -n3 hostname
flux submit -N3 -n3 sleep 30

Examine the status of these jobs:

flux jobs -a

Examine the output of the first job

flux job attach <jobid printed from the first submit>

Examine the ring buffer for details on what happened.

flux dmesg

Exit the Flux instance

exit

¹ The name was inspired by Issac Newton's Method of Fluxions where fluxions and fluents are the key terms to define his calculus. As his calculus describes the motion of points in time for time-varying variables, our Fluxion scheduler uses scalable techniques to describe the motion of scheduled points in time for a diverse set of resources.

License

SPDX-License-Identifier: LGPL-3.0

LLNL-CODE-764420

flux-sched's People

Contributors

Stargazers

Watchers

flux-sched's Issues

Need an ability to (re)construct an in-memory RDL from a serialized RDL subset

To study multilevel hierarchical scheduling schemes (by @SteVwonder), we need this capability.

Specifically, the scheduler of a Flux instance will select an RDL subset and serialize it and send it to a child Flux instance. The child will then deserialize this RDL subset into its own full RDL.

Though we don't have hierarchical launching yet, we can easily emulate the parent-children relationship by invoking mutliple indepedent Flux instances and pretending that they form the relationship. But these instances will need to serialize/deserialize RDLs freely to study the attributes of multilevel hierarchical scheduling.

Provide flexible options to traverse the resrc hierarchy

This issue captures the request to design a level of abstraction in resrc that provides the same flexibility and functionality that the hwloc API provides. This would include task mapping options.
#79 (comment)

flux-sched needs testsuite and CI support

The flux-sched framework project (and other framework projects) should get tests and travis-ci integration like flux-core. In fact, we should have some scheme to easily add this support to all framework projects.

sched module should set errno when exiting mod_main()

If the sched module is loaded without an rdl-conf option, it returns -1 from mod_main. The broker code then does this

if (p->main(p->h, ac, av) < 0) {
    err ("%s: mod_main returned error", p->name);
    goto done;
}

the err() function tries to decode errno and print it, but since it's not ever set by the sched module a residual one is picked up, e.g.

lt-flux-broker: sched: mod_main returned error: Bad file descriptor

I would just set errno = EINVAL in the !path case of load_rdl() so users don't see an error that throws them off course.

Invoke scheduler automatically when launching flux from flux

Until now, scheduler service is invoked through the 2 step commands:
~/flux-core/src/cmd/flux -M ~/flux-sched/sched -C ~/flux-sched/rdl/\?.so -L ~/flux-sched/rdl/\?.lua start -s 3
~/flux-core/src/cmd/flux -M ~/flux-sched/sched -C ~/flux-sched/rdl/\?.so -L ~/flux-sched/rdl/\?.lua module load sched rdl-conf=../conf/hype.lua

When flux launches flux, this needs to be automated. The most simplest way seems to be adding sched[0] to the default_modules in the broker. That way, the root instance as well as any child instance can directly start the sched module during startup.
root:
flux -M ~/flux-sched/ start -s2
child:
flux -M ~/flux-sched/ broker

I am not sure if thats the way to go for production, but it serves my immediate purposes.

But a follow-up issue on that is the RDL. The scheduler needs to know its rdl-conf during startup. This can be achieved by having the root flux look into a default directory/file (~/flux-sched/conf/default-conf.lua) and the provide an additional argument to broker (-C dir/conf.lua) when the parent flux launches the child flux. That means that the parent scheduler prepares a conf for the child and then passes it as an argument.

Another way to do that without having to extend the broker is: when the child scheduler starts up, it has to handshake with the parent (which is inevitable anyway for the dynamic scheduling). And as a part of the handshake, the parent can pass on the config (as json objects) to the child scheduler.

I am more inclined to do the latter.

If there is a really correct way to do this, I would switch to that approach. Or if the right approach needs to evolve with time/other aspects, I would go ahead with this for now so that I can start with the scheduling problem. Please let me know your thoughts.

Migrate backfill scheduling support from the sched framework module to its own plugin.

The latest enhancements that added backfill scheduling were helpful and carefully done so as to not impact the existing FCFS operation. However, the vision for the schedsrv framework service was to provide the foundation for a variety scheduling algorithms. Backfill is the first such alternative algorithm to be developed. This issue is to move backfill scheduling out of the sched framework and into its own plugin.

Support for checking consistency between resrc and hwloc data

Some of the problems are mentioned in #58 as well as flux-core#469. We need a way to perform a consistency check between "self-discovered" hwloc resource data and resrc data in resrc reader mode. When the manually specified resrc/rdl configuration is inconsistent with the hwloc resource data, we need flux-sched to print out an error message and fall back to hwloc reader mode in populating the resrc structure.

With any inconsistency, certain process-spawn request against a job allocation will not be recognized by broker ranks when the new scheme is designed/implemented as part of flux-core#469, and this would lead to other undesirable side effects. Of course, this check can be done at the flux-core level, as well. For example, one can add a consistency-check service directly into flux-core, which would perform a check using a distributed algorithm similar to wrexecs doing intersection checks. If designed right, this should be more scalable but then, this would come at the cost of increased complexity.

The easiest way to do this is for flux-sched to fetch the hwloc data and use it to check consistency against each node-type resrc object. The equality check can be customized if needed.

I'd like to hear from others about a good way to implement this. I'm open to either approach at this point.

vpath functionality under automake

In autoconfiscating flux-sched, I rely on automake and libtool. The Makefiles that these two tools create do not elicit the vpath functionality that is present in the old, hand-written Makefiles.

For example, these are present in the flux-sched/Makefile.inc:

vpath %.c $(FLUX_SRCDIR)/src/bindings/lua
vpath %.c $(FLUX_SRCDIR)/src/common/liblsd
vpath %.c $(FLUX_SRCDIR)/src/common/libutil

And when make goes to find the .c source to make xzmalloc.o for example, it searches the vpath directories above until if finds xzmalloc.c.

I have not found how this mechanism can work with automake/libtool. There is vpath support, but it is for building in a separate directory outside of the source directory.

I did not find a way to construct the Makefile.am files that would activate the vpath directory search that the old Makefiles do. I created a workaround that suffices, but it is admittedly a bit of a kludge. I'm all for replacing it if anyone knows the proper way to write a Makefile.am file that would do the vpath search.

Q: use of lwj.next-id by flux-sched/simulator

It doesn't appear that anything in flux-sched still depends on lwj.next-id as it is currently used by wreck's job module as storage for the next identifier to use for a job. I was considering removing this key for a builtin sequence number in job module, but I do notice the simluator at least installs a watch on this key. Since we have wrexec events working now, does any part of sched still depend on this key, and if so could that easily be migrated? If not, it is no big deal to keep next-id up to date.

move off of deprecated flux-core functions

Since we're looking to be able to build flux-sched against a packaged and installed version of flux-core, flux-sched should stop using internal-only deprecated functions:

simulator.c: In function ‘send_alive_request’:
simulator.c:234: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
simulator.c: In function ‘send_join_request’:
simulator.c:279: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
simsrv.c: In function ‘send_start_event’:
simsrv.c:106: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
simsrv.c: In function ‘join_cb’:
simsrv.c:225: warning: ‘compat_size’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:12)
simsrv.c: In function ‘mod_main’:
simsrv.c:429: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
simsrv.c:463: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)
submitsrv.c: In function ‘mod_main’:
submitsrv.c:356: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
submitsrv.c:382: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)
sim_execsrv.c: In function ‘mod_main’:
sim_execsrv.c:641: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
sim_execsrv.c:658: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)
scheduler.c: In function ‘init_and_start_scheduler’:
scheduler.c:1647: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
scheduler.c:1717: warning: ‘jsc_notify_status_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:99)
scheduler.c:1727: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)
../simulator.c: In function ‘send_alive_request’:
../simulator.c:234: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
../simulator.c: In function ‘send_join_request’:
../simulator.c:279: warning: ‘compat_rank’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/info.h:10)
schedsrv.c: In function ‘fill_resource_req’:
schedsrv.c:295: warning: ‘jsc_query_jcb_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:110)
schedsrv.c: In function ‘update_state’:
schedsrv.c:326: warning: ‘jsc_update_jcb_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:122)
schedsrv.c: In function ‘reg_sim_events’:
schedsrv.c:867: warning: ‘jsc_notify_status_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:99)
schedsrv.c: In function ‘reg_events’:
schedsrv.c:946: warning: ‘jsc_notify_status_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:99)
schedsrv.c: In function ‘req_tpexec_allocate’:
schedsrv.c:1128: warning: ‘jsc_update_jcb_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:122)
schedsrv.c:1140: warning: ‘jsc_update_jcb_obj’ is deprecated (declared at /home/garlick/proj/flux-core/src/modules/libjsc/jstatctl.h:122)
schedsrv.c: In function ‘mod_main’:
schedsrv.c:1558: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)
flux-waitjob.c: In function ‘wait_job_complete’:
flux-waitjob.c:202: warning: ‘compat_reactor_start’ is deprecated (declared at /home/garlick/proj/flux-core/src/common/libcompat/reactor.h:87)

failed to load sched plugin backfill.plugin1

With installed flux-sched, I'm seeing the following error loading backfill.plugin1 -- maybe a problem with my environment, so I wonder if anyone else can reproduce.

(In the output below both flux-core and flux-sched are installed with --prefix=/tmp/flux)

�grondo@flux-core:~ $ /tmp/flux/bin/flux start flux module load sched plugin=backfill.plugin1
flux-start: setrlimit: could not remove core file size limit: Operation not permitted
[1456501325.764353] sched.err[0]: failed to open sched plugin: /tmp/flux/lib/flux/modules/backfillplugin1.so: undefined symbol: xzmalloc
[1456501325.764392] sched.err[0]: failed to load scheduler plugin
[1456501325.764417] sched.crit[0]: fatal error: Bad file descriptor

flux-sched travis fails because of no hostlist.h

Travis CI for flux-sched master fails with

resrc.c:35:40: fatal error: src/common/liblsd/hostlist.h: No such file or directory

This was mentioned in #400 of flux-core and we need a resolution to get Travis going again.

Need better mapping/scheduling between tasks and nodes/cores

This was discussed a bit as part of #58, but I wasn't sure if a new issue has been created for this. As is, even if you launch 4 brokers, and then submit each job requiring 2 nodes back to back to back (e.g., my stress case), sched maps these 2-node jobs only to the first two ranks, never using the other two ranks.

This is just to discuss if there are some better mapping/scheduling that sched can do as a short-term improvement?

@lipari: what is the semantics of find_resources and select_resources of the schedplugin1 plugin? Does it simply treat all of the resources as shared and grab the first available resources regardless they have running jobs? Are there any other scheduler plugins in the works with more smarts?

Also, if there have been some updates to wreckrun with respect to job-to-resource mapping, it seems worthwhile to match that in the sched-flux as well. If wreckrun launches jobs back to back what is the current semantics? Anyone has issue numbers that capture some of these questions, mind listing the numbers here?

symbol leakage in sched libraries and modules

libflux-rdl.so and libflux-sim.so, both installed libraries, don't have a symbol map or other mechanism in place to limit symbol exposure, so they are leaking internal stuff like xzmalloc, err_exit, etc..

Same for the sched plugins: backfillplugin1.so and schedplugin1.so.

libflux-resrc.so looks OK.

The comms modules: schedsrv.so, sim_execsrv.so, submitsrv.so, sim_sched_fcfs.so sim_sched_fcfs_aware.so, and sim_sched_easy.so all export only mod_main() so they are correct.

Failure "flux module load" does not return an error code.

The reproducer is the 4th test case of t/t0001-basic.t

$ make check

<CUT>

not ok 4 - this flux-module load should fail
#   
#       test_must_fail flux module load ${schedsrv} 
#

sched gets a "complete" event, but leaves state==submitted

I'm not sure how this is happening, but when using sched with cap, I'm noticing it work great for the first 15 or so, and then I stop getting update events from the system. The funny thing about this is that sched seems to get the events, because it marks completed-time in the kvs, but doesn't change the state somehow. The contents of the kvs directory of a task where this has happened looks like this:

lwj.75.cmdline = [ "hostname" ]
lwj.75.ntasks = 1
lwj.75.nnodes = 1
lwj.75.environ = { ... }
lwj.75.cwd = /g/g12/scogland/projects/flux/capacitor
lwj.75.create-time = 2015-09-02T10:37:04
lwj.75.rdl = { "cab2": { "socket0": { "core0": "core" } } }
lwj.75.rank.
lwj.75.starting-time = 2015-09-02T10:37:05
lwj.75.running-time = 2015-09-02T10:37:05
lwj.75.0.
lwj.75.complete-time = 2015-09-02T10:37:05
lwj.75.state = submitted

Errors in travis testing with clang

I'm now seeing some errors during build in travis. Not sure why I didn't see these before:

schedsrv.c:261:1: error: unused function 'set_event' [-Werror,-Wunused-function]

Will try to propose a fix today.

Add support for exclusive and shared allocations to the resource model

This issue grew out of a request to add the time dimension to the resource model Issue 62. Within that discussion, a need for shared and exclusive allocation was presented.

We want to be able to support multiple job allocations per resource. In a simple example, one job runs on one core while another job runs on a different core from the same node. In this case, there needs to be a way to indicate that the parent node of both cores is running two jobs.

Currently in the code, when a core is allocated to a job, its parent node is exclusively allocated to the job as well. This must be corrected.

In addition, we need to support a request for exclusive access to a resource with all of its child resources, whatever they are. A job requesting exclusive access to a resource must receive a resource that is is idle (not allocated to any jobs) and for which all of its descendent resources are idle as well.

flux-sched does not compile when ENABLE_TIMER_EVENT is switched on

When ENABLE_TIMER_EVENT is 1, the code doesn't compile because:

It can't find SCHED_INTERVAL macro which is actually defined in simulator/scheduler.h. Including this file in schedsrv.c or .h introduces too many errors (I guess this is the alternate scheduler implementation of @SteVwonder which must not be mixed with the original scheduler).

After moving SCHED_INTERVAL to simulator/simulator.h, it still breaks because of the following
error: 'ssrvctx_t' has no member named 'run_schedule_loop'

Intermittent failure of t1001-rs2rank-basic.t

The t1001-rs2rank-basic.t tests were bypassed in commit 39d2b26. I think I see why they are intermittently failing, but I'm not sure how best to fix it. The test invokes a similar sequence of commands for every test which include a variation on the following:

    flux module load sched sched-once=true &&
    timed_wait_job 5 &&
    submit_1N_nproc_sleep_jobs ${shrd_1N4B_nc} 0 &&

My tests appear to show that when a failure happens, it is because a job was submitted before the sched module had reached its flux_reactor_run() call. This would imply that flux module load xxx returns right away, before the module has entered its reactor loop. I expect this is by design.

If so, the question becomes: how can we gracefully hold off submitting the first job before the sched module has entered its reactor loop?

Add autotools support

config changes for sched module

Many flux path variables need to be extended in order to run flux outside of flux-core. The "start" and "load" Makefile targets in the flux-sched/sched directory automate path extensions so that the resulting "flux start" and "flux module load" commands work as expected.
Running straight flux commands out of flux-sched/sched require a mechanism similar to what is done with the makefile to augment path parameters. Whatever we do to address this need should be applicable to future projects that will be built against flux-core.

Deserialized id and name not consistent.

There seems to two not-so-important issues with serialize/deserialize RDL.

Serialization does not include any of the job's ids. - Easily solved by adding one line to include an id to resrc_to_json(). Where this is important is when resources are passed down to child instances, they need to have consistent ids.
The resource name is wrong after deserialization.
Example: resrc->name = hype201, id=201.
Serialization (including id): "name" : hype201, "id" : 201
Deserialzaiton: resrc->name = hype201201, id = 201

The id is concatenated with name. Thats of course because the deserialization code uses the same function that generates resources from the conf file.

This is not important, but I would just like to point it out. resrc_new_from_json() makes that concatenation. It may be a good idea to not do such a concatenation inside resrc_new_from_json() so that it stays generic. For the purpose of loading RDL from file, the json object passed to resrc_new_from_json() can be modified to already consist of the right name with the concatenation.

Floating point exception

There is now a version of capacitor with a flag to select the flux-sched backend on demand. The first run of that with flux-sched loaded and populated with a 4-node cab RDL produced this result:

scogland at cab668 in ~/projects/flux/capacitor (master●)
$ ./flux-capacitor -S flux -n 4 hostname
cl command : ['hostname']
flux-broker: [1440546532.910099] job.info[0] Setting job 3 to reserved
flux-broker: [1440546532.910133] sched.debug[0] new_job_cb invoked: key(lwj.next-id), val(4)
flux-broker: [1440546532.910175] sched.debug[0] jobstate_hdlr registered
flux-broker: [1440546532.910206] sched.debug[0] attempting job 3 state change from null to null
flux-broker: [1440546532.910413] sched.err[0] job_state_cb: key(lwj.3.state), val((null))
flux-broker: [1440546532.910431] sched.debug[0] registered job lwj.3.state CB
flux-broker: [1440546532.911769] sched.debug[0] attempting job 3 state change from null to reserved
lwj.3.state
3 reached reserved
flux-broker: [1440546532.914290] sched.debug[0] attempting job 3 state change from reserved to submitted
flux-broker: [1440546532.914592] sched.debug[0] extract lwj.3.nnodes: 0
lwj.3.state
3 reached submitted
flux-start: 0 (pid 188920) Floating point exception
flux-start: 2 (pid 188922) Killed
flux-start: 3 (pid 188923) Killed
flux-start: 1 (pid 188921) Killed

It seems to happen consistently. To reproduce, build and install flux-core (my cap_support branch is required for python fixes) and flux-sched in the same prefix, start flux with four ranks on one node, then load flux-sched with this RDL:

uses "Node"

Hierarchy "default" {
    Resource{ "cluster", name = "cab",
    children = { ListOf{ Node,
                  ids = "1-4",
                  args = { name = "cab", sockets = {"0-7", "8-15"} }
                 },
               }
    }
}

With sched loaded, run flux capacitor -n 4 hostname. The submit script also seems to be having an issue in this configuration, but it's less fatal.

Resource reading enhancements

This issue captures residual enhancements to the resource reader that were proposed in comments to PR 79 and Issue 58, but have yet to be implemented.

read resources from a file specified in the kvs config directory (config.sched.rld-conf) instead of arguments to the sched module load command (#58 (comment))
create a hwloc topology object from custom defined resources for testing instead of using the resources of the build system (#79 (comment))
add consistency checking of the valid arguments ("rdl-conf", "in-sim", etc.) the sched module load command to ensure compatibility.
find a reliable source for the cluster's name instead of peeling away the id from the first node Issue 442

t/Makefile.am:17: if $(FLUX_SCHED_TEST_INSTALLED: non-POSIX variable name

I'm getting this warning from autoreconf on current master 23ca224

$ ./autogen.sh
Running autoreconf -i ... 
t/Makefile.am:17: if $(FLUX_SCHED_TEST_INSTALLED: non-POSIX variable name
t/Makefile.am:17: (probably a GNU make extension)
Cleaning up ...
Now run ./configure.

How to make cmdb quiet for make check?

As part of the new sharness testing rig, I found that cmbd prints out messages to the terminal for flux module load and flux module remove tests.

$pwd
/g/g0/dahn/fluxdev_space/flux-sched/t

$ ./t0001-basic.t 
cmbd: [1416008070.397107] cmbd.info[0] insmod sched /tmp/flux-modctl-TFF1Yr
cmbd: [1416008070.397241] sched.info[0] sched comms module starting
cmbd: [1416008070.399503] sched.debug[0] loaded: sched.plugin1
cmbd: [1416008070.399527] sched.debug[0] LUA_PATH /g/g0/dahn/fluxdev_space/flux-sched/rdl/?.lua;/g/g0/dahn/fluxdev_space/flux-core/src/bindings/lua/?.lua;/usr/local/share/lua/5.1/?.lua;;;
cmbd: [1416008070.399538] sched.debug[0] LUA_CPATH /g/g0/dahn/fluxdev_space/flux-sched/rdl/?.so;/g/g0/dahn/fluxdev_space/flux-core/src/bindings/lua/.libs/?.so;/usr/local/lib64/lua/5.1/?.so;;;
ok 1 - flux-module load works
cmbd: [1416008070.409122] cmbd.info[0] rmmod sched
cmbd: [1416008070.574292] sched.info[0] using default rdl resource
cmbd: [1416008070.575344] sched.debug[0] registered lwj creation callback
ok 2 - flux-module remove works
cmbd: [1416008070.605001] cmbd.info[0] insmod sched /tmp/flux-modctl-sY6cCA
cmbd: [1416008070.605087] sched.info[0] sched comms module starting
cmbd: [1416008070.607416] sched.debug[0] loaded: sched.plugin1
cmbd: [1416008070.607442] sched.debug[0] LUA_PATH /g/g0/dahn/fluxdev_space/flux-sched/rdl/?.lua;/g/g0/dahn/fluxdev_space/flux-core/src/bindings/lua/?.lua;/usr/local/share/lua/5.1/?.lua;;;
cmbd: [1416008070.607453] sched.debug[0] LUA_CPATH /g/g0/dahn/fluxdev_space/flux-sched/rdl/?.so;/g/g0/dahn/fluxdev_space/flux-core/src/bindings/lua/.libs/?.so;/usr/local/lib64/lua/5.1/?.so;;;
cmbd: [1416008070.612124] cmbd.info[0] rmmod sched
cmbd: [1416008070.770566] sched.info[0] using default rdl resource
cmbd: [1416008070.771575] sched.debug[0] registered lwj creation callback
ok 3 - flux-module load works after a successful unload
cmbd: [1416008070.800638] cmbd.info[0] insmod sched /tmp/flux-modctl-Q5mMNJ
cmbd: [1416008070.800692] sched.info[0] sched comms module starting
cmbd: sched: mod_main returned error: Resource temporarily unavailable
cmbd: [1416008070.802883] sched.debug[0] loaded: sched.plugin1
cmbd: [1416008070.802917] sched.err[0] rdl-conf argument is not set
not ok 4 - this flux-module load should fail
#   
#       test_must_fail flux module load ${schedsrv} 
#   
not ok 5 - flux-module load works after a load faiure
#   
#       flux module load ${schedsrv} rdl-conf=${rdlconf}
#   
ok 6 - flux-module list
# failed 2 among 6 test(s)
1..6
flux-start: 0 (pid 97934) exited with rc=1

From test_under_flux()@sharness.d/flux-sharness, it seems -o -q option is being passed to flux start and I did check -q has been passed to cmbd so I don't understand why these messages still show up.

 99317 pts/51   Sl+    0:00 /g/g0/dahn/fluxdev_space/flux-core/src/broker/cmbd --size=1 --rank=0 --sid=99300 -q --command=sh ./t0001-basic.t

Please update README.md for testing under SLURM

The description on how to test flux-sched under SLURM is not accurate. This is a reminder that we need to update it.

Scheduling multiple jobs on the same node and resource states

At the moment, a node cannot be shared between jobs because as soon as a job is scheduled on a node, the state of the node resource is ALLOCATED. the match function looks only for IDLE resources and therefore, the node parent and its children (cores) of the resource tree will not be considered for scheduling. A simple fix would be to keep the state of the node resource as IDLE until all of its cores are ALLOCATED.
A second solution would be to hold the state of the node resource as something like PARTIALLY_ALLOCATED, when at least one of the cores is busy and not all cores have been allocated. When considering other scheduling aspects, I am not sure if its useful or only adds some complexity. But for the resource movements in dynamic scheduling, it certainly makes some checks a lot easier.

sched should provide rc1/rc3 scripts

With flux-framework/flux-core#597 merged, sched can now provide startup/shutdown scripts that cause it to be loaded by default when Flux is started in the normal ways with flux-start or srun flux-broker. Specifically something like this:

fluxrc1dir = $(sysconfdir)/flux/rc1.d
fluxrc3dir = $(sysconfdir)/flux/rc3.d

fluxrc1_SCRIPTS = sched-start
fluxrc3_SCRIPTS = sched-stop

sched-start:

#!/bin/bash -e
flux module load -r 0 sched

sched-stop:

#!/bin/bash -e
flux module remove -r 0 sched

Update RDL Lua to 5.2

Minimally, just support load, access and serialize.

Short-term work in support of UQ workload with flux-sched

This is just to summarize my take on a set of investigation points that can position flux-sched to cope better with uncertainty-quantification (UQ) workload, an emerging workload type that is becoming increasingly important for the DOE labs. These came from looking more closely into MiniUQPipeline (MiniUQP), talking to an UQPipeline (UQP) developer again, and doing some literature search on the production solutions that aim at supporting this type of workload.

MiniUQP is a mock-up program that Luke Johnston (Tammy Dahlgren’s summer intern) put together in order to express the scheduling behavior of UQP as well as its resource management and scheduler need to Flux using much simpler software. The state-of-the-art production solutions are all centered on the notion of job array which is supported by many production schedulers: i.e., PBS, PBS Pro, LSF, and SLURM.

My high-level summary is three-fold:

There are at least three apparent challenges that UQ workloads present -- high job throughput challenge, affinity control challenge, and file system challenge;
The job-array approaches fall short of addressing the UQ challenges; and
There are some short-term investigations within flux-sched, which can be done NOW to position us better.

The following details each of them.

enhance travis testing

As mentioned in #121, flux-sched could add

coveralls support
travis distcheck
travis install sanity check

Not sure how convenient it is to do this, but would it make sense to have the "main" travis builds be against a tagged flux-core, and a one-off builder be against flux-core master? That way it will be easy to distinguish problems with a PR under test from breakage due to flux-core changes on master?

scheduling throughput degrades linearly and needs further investigation

This is just to have a placeholder to further investigate job scheduling and launching performance. Using a variant of one of the new test cases I've added (t/t1003-stress.t), it seems the job scheduling and launching throughput performance degrades linearly:

The testing configuration was with 4 brokers, each reloading a fake full cab xml file. Then, the script submit/schedule/launch 16-way sleep 0 jobs. It took linearly longer time to complete 1000 jobs at a time.

I will validate/verify this further. But it seems clear we will need a better perf analysis. I plan to add some instrumentation so that I can apply a critical path analysis for various configurations. This may be the same issue that Tom has been seeing - but having a breakdown through CPA should be insightful.

cab690{dahn}43: cat begin.1 
Mon Jan 18 05:54:55 UTC 2016
cab690{dahn}44: cat end.1
Mon Jan 18 05:56:49 UTC 2016
=> 1 min 54 seconds  (115 seconds)

cab690{dahn}45: cat begin.2
Mon Jan 18 05:56:49 UTC 2016
cab690{dahn}46: cat end.2
Mon Jan 18 06:02:00 UTC 2016
=> 5 min 11 seconds  (311 seconds)

cab690{dahn}48: cat begin.3 
Mon Jan 18 06:02:01 UTC 2016
cab690{dahn}49: cat end.3 
Mon Jan 18 06:11:22 UTC 2016
=> 9 min 21 seconds (561 seconds)


cab690{dahn}50: cat begin.4 
Mon Jan 18 06:11:23 UTC 2016
cab690{dahn}51: cat end.4 
Mon Jan 18 06:25:28 UTC 2016
=> 14 min 5 seconds (845 seconds)

[schedsrv] convert to argc, argv style argument parsing

In PR flux-framework/flux-core#289 the module main function prototype changed to pass arguments in argc, argv format rather than a zhash_t. This was to eliminate another CZMQ exposure in the public API, and also to conform with RFC 5.

In my haste to get this fixed throughout the code base I avoided doing the right thing in flux-sched/sched/schedsrv.c, and just converted the args to a hash and let the existing code run. Probably this should be cleaned up as a low priority.

This code is also duplicated in the simulator modules.

Add support necessary to time-based scheduling schemes

We discussed this at today's flux meeting (5/11/2015). Soon flux-sched will need support so that time-based scheduling algorithms/policies including EASY and conservative backfills can easily be implemented as a scheduler plugin. We discussed at least two possibilities: 1) we do this by building time-range querys as part of RDL abstraction and/or 2) by supporting data structures (e.g., gap_list) that can keep track of RDL states over time (e.g., gaps_list). Either has pro and cons. Ideally, either can be built on top of the same underlying mechanisms. Regardless, this support must resource-effiecient and performant with respect to very high numbers of resources and jobs.

Provide expanded resource input options

This issue resulted from the discussion in flux-sched Issues #33, #58, and #60. Currently the scheduler discovers the resources it can schedule based on the rdl-conf= argument at module load time. And Lua is the only format of the RDL configuration file that is supported.

The supported resource input formats need to increase to include a serialized format (yet TBD), and resources from the resource.hwloc key in the KVS.

Furthermore, there needs to be the additional flexibility of specifying the source of the resource definition in the flux config keyspace, e.g., config.sched.rld-conf, resources assigned to the enclosing instance, as well as "fake" resources for testing purposes.

Finally, there needs to be validation of the resources read from the input against the resources available in the enclosing instance.

not ok 5 - jobs scheduled in correct order

I'm finding that test 5 - jobs scheduled in correct order is consistently failing for me in the following tests

t2000-fcfs

Expected: 1 2 3 4 5 6 7 8 9 10 11 12
Actual: 1 2 4 10 11 12

t2001-fcfs-aware

Expected: 1 2 3 4 5 6 7 8 9 10 11 12
Actual: 1 2 4 10 11 12

t2002-easy

Expected: 1 2 3 6 7 8 9 10 11 4 5 12
Actual: 1 2 4 5 6 7 9 11 10 8 12 3

This occured on current master of flux-sched: f912cb2 and flux-framework/flux-core@ad0f54b.
I also tried one PR back in flux-framework/flux-core@8d00634

This is on RHEL 6/TOSS 2 based desktop.

Eliminate from flux-sched products the dependency on libflux-internal

This is the beginning step toward streamlining flux-sched's build system. This leaves a dependence on the flux-core library only.

Travis fails with hostlist

I'm getting Travis CI build failures. I initially thought this was because of my changes in the waitjob_fix branch of my fork. But then when I sync'ed the master of my fork with the upstream sched master, I notice I'm still getting the same failure. Travis CI didn't complained about @lipari's latest PR so I'm kind of baffled.

ok 3 - resoure file readable
loading RDL: /home/travis/build/dongahn/flux-sched/rdl/RDL.lua:30: module 'hostlist' not found:
    no field package.preload['hostlist']
    no file '/usr/local/share/lua/5.1//hostlist.lua'
    no file '/usr/local/share/lua/5.1//hostlist/init.lua'
    no file '/home/travis/.luarocks/share/lua/5.1//hostlist.lua'
    no file '/home/travis/.luarocks/share/lua/5.1//hostlist/init.lua'
    no file '/usr/share/lua/5.1//hostlist.lua'
    no file '/usr/share/lua/5.1//hostlist/init.lua'
    no file './hostlist.lua'
    no file '/usr/local/share/lua/5.1/hostlist.lua'
    no file '/usr/local/share/lua/5.1/hostlist/init.lua'
    no file '/usr/local/lib/lua/5.1/hostlist.lua'
    no file '/usr/local/lib/lua/5.1/hostlist/init.lua'
    no file '/usr/share/lua/5.1/hostlist.lua'
    no file '/usr/share/lua/5.1/hostlist/init.lua'
    no file '/home/travis/build/dongahn/flux-sched/rdl/hostlist.lua'
    no file '/home/travis/flux-core/src/bindings/lua/hostlist.lua'
    no file './hostlist.lua'
    no file '/usr/local/share/lua/5.1/hostlist.lua'
    no file '/usr/local/share/lua/5.1/hostlist/init.lua'
    no file '/usr/local/lib/lua/5.1/hostlist.lua'
    no file '/usr/local/lib/lua/5.1/hostlist/init.lua'
    no file '/usr/share/lua/5.1/hostlist.lua'
    no file '/usr/share/lua/5.1/hostlist/init.lua'
    no file '/usr/local/lib/lua/5.1//hostlist.so'
    no file '/home/travis/.luarocks/lib/lua/5.1//hostlist.so'
    no file './hostlist.so'
    no file '/usr/local/lib/lua/5.1/hostlist.so'
    no file '/usr/lib/x86_64-linux-gnu/lua/5.1/hostlist.so'
    no file '/usr/lib/lua/5.1/hostlist.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
    no file '/home/travis/build/dongahn/flux-sched/rdl/hostlist.so'
    no file '/home/travis/flux-core/src/bindings/lua/.libs/hostlist.so'
    no file './hostlist.so'
    no file '/usr/local/lib/lua/5.1/hostlist.so'
    no file '/usr/lib/x86_64-linux-gnu/lua/5.1/hostlist.so'
    no file '/usr/lib/lua/5.1/hostlist.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
not ok 4 - resource generation took: 0.001116
#   Failed test 'resource generation took: 0.001116'
#   at tresrc.c line 130.
# Looks like you planned 13 tests but ran 4.
# Looks like you failed 1 test of 4 run.
make[3]: *** [check] Error 1
make[3]: Leaving directory `/home/travis/build/dongahn/flux-sched/resrc/test'
make[2]: *** [check] Error 2
make[2]: Leaving directory `/home/travis/build/dongahn/flux-sched/resrc'
make[1]: *** [resrc] Error 2
make[1]: Leaving directory `/home/travis/build/dongahn/flux-sched'
make: *** [check] Error 2

The command "./config $HOME/flux-core && make check" exited with 2.
store build cache
changes detected, packing new archive
uploading archive

Done. Your build exited with 1.

The latest msg watcher rework broke flux-sched

This is just to see if anyone is working on a PR. If not, I can fix this part of the current jsc test case bug fixing work.

Resource hoarding support

This is to propose resource hoarding support into flux-sched.

I want to create automatic tests where 1-node job requests are submitted back to back until the entire node resources within the Flux instance is filled up.

Given how resrc walks the resource tree, scheduling will be deterministic and thus easy to verify automatically. In addition, how quickly schedsrv can fill up all of the node resources with such unit jobs can be a good high-level metric for performance/scalability.

But one problem I face is a race: i.e., some earlier jobs can complete before later jobs are scheduled and then, these released nodes can be reused for scheduling these later jobs -- making thing potentially non-determininistic.

To avoid the race, my tests currently use sleep k where k is sufficiently large. Since I won't know how large of the sleep amount will be needed for large scale testing, this scheme is fragile. I thought about other ways like using file IO to ensure earlier jobs don't complete before all of the jobs are scheduled and launched. But these can lead to pretty bad hacks.

Instead, perhaps simple resource hoarding support within sched will beef up such testing. I can have hoard=true option to schedsrv module load, which then just simply doesn't release the resources even when a job completes. This way, sched will simply schedule all of the jobs to new nodes not being able to reuse the released node.

If this support is there, I can use sleep 0 as a way to quickly fill up the instance resources and stress test the system. Of course, this will only be used for a subset of automatic tests.

Any blind spot in this line of thoughts?

sched use of hwloc data requires synchronization

If the sched module is loaded too early, hwloc data won't be there yet e.g.

$ flux start -o,--module=sched
[1456339141.938712] sched.err[0]: can't get hwloc data in kvs (resource.hwloc.xml.0)

There needs to be some synchronization here, but I'm not sure what's appropriate. Is it possible to use the resource-hwloc.topo query for aggregated xml? That rpc at least will return EAGAIN so you can use a backoff-retry to obtain the data.

Also (minor) I think there's a memory leak in build_hwloc_rs2rank(): kvs_get_string() allocates a new rs_buf each iteration of the for loop, but only the last copy is freed before the function returns

RDL: assertion failure in librdl

Hitting an assertion failure in list.c whenever using the rdl implementation in flux-sched:
(This is probably in rdl_destroy() or similar)
Example:

 grondo@hype356:~/git/flux-sched.git/rdl$ ./flux-rdltool -f ../conf/cab.lua tree default:/cab/cab1
/cab1
 /socket0
  /core0
  /core1
  /core2
  /core3
  /core4
  /core5
  /core6
  /core7
 /socket1
  /core8
  /core9
  /core10
  /core11
  /core12
  /core13
  /core14
  /core15
flux-rdltool: /g/g0/grondo/git/flux-core/src/common/liblsd/list.c:244: list_destroy: Assertion `l->magic == 0xDEADBEEF' failed.
Aborted (core dumped)

flux-cpuset.so not installed correctly

I noticed that the Lua cpuset.so module was renamed to flux-cpuset.so (I'm assuming to differentiate it from any installed cpuset module), but it is still installed into $(luaexecdir)/flux so it will fail to load with : require ("flux-cpuset").

Instead we should either install the flux-cpuset.so module directly to $(luaexecdir) or alternately follow what we did in flux-core and install cpuset.so to $(luaexecdir)/flux, and load the module as flux.cpuset. To do this a link from RDL/flux/cpuset.so->../cpuset.so would be needed to be made at build time in flux-sched project.

I'm going to work on the latter solution, unless I hear that others would rather keep the flux-cpuset.so name in the defualt LUA_CPATH.

Add generic RDL support

The scheduler framework service should be written to the general RDL interface so that its users (e.g., scheduler plugins) can choose an RDL implementation of choice while the framework service remains unchanged.

This support may be needed for @SteVwonder's summer investigation: understanding the attributes of the envisioned hierarchical scheduling scheme. He may want to flip back and forth between the old RDL and new RDL replacement. Depending on his direction, this item should get a different priority. TBD.

Need temporal allocations in resrc

@lipari, already has the idea sketched out (which he shared with @surajpkn and I).
My summary from our short meeting is:

Expand the resrc_allocate and resrc_release functions to include a start and end time
Either (or both) of these times could be -1, meaning infinity
Include a function that allows you to determine if a resource is allocated at a particular time
- This would perform an intersection check for every allocation

make sched plugins conform to RFC 5

There was a preliminary design expressed in RFC 5 which would allow flux module load to load modules that were plugins to comms modules, flux module list to list modules and their submodules, etc.. The design was prototyped in the test programs in flux-core/src/test/module and driven by the t0003-module.t sharness test. (Note these test modules use deprecated API functions and need to be reworked). Basically a module that wants to load plugins implements insmod/rmmod/lsmod handlers.

I see that the sched plugin design initially followed RFC 5 in that its plugins define mod_name and it uses flux_modfind() to locate modules by name, but perhaps was not made fully compliant due to some design deficiency in the rfc or in the flux_mod* functions exported by core.

I would like to propose that we revisit this and see if we can come up with a generalized module loading strategy that can be used for all framework projects including sched.

The current state of things in sched is problematic: Initially in #115 @lipari proposed installing sched plugins into $libdir/flux/modules, but during the iteration I requested that they go to their own location due to non-compliance with RFC 5. However, the sched module is still using FLUX_MODULE_PATH and flux_modfind() to locate its plugins, so that is not going to work for an installed flux-sched.

Merging the emulator to the current scheduler framework service and new RDL

Stephen, Suraj and I have just discussed this. To facilitate future scheduling development and research (e.g., dynamic scheduling), we will make it a priority to merge the emulator codes into the latest flux-sched. I am creating this issue to capture the communications needed for this effort.

Need a fast rdl_find

Current prototype rdl_find is slow due to use of accumulator presumably.

For the short term, attempt to hack together a faster or lightweight rdl_find as proposed by @SteVwonder