openpbs / openpbs Goto Github PK

View Code? Open in Web Editor NEW

673.0 80.0 327.0 39.67 MB

An HPC workload manager and job scheduler for desktops, clusters, and clouds.

Home Page: https://www.openpbs.org

License: Other

Makefile 0.70% Shell 2.29% Python 34.43% M4 0.61% C 52.56% Batchfile 0.04% Awk 0.24% C++ 8.93% Roff 0.09% SWIG 0.10%

hpc pbs pbs-professional job-scheduler cloud cluster green provisioning hooks plugins

openpbs's Introduction

OpenPBS Open Source Project

If you are new to this project, please start at https://www.openpbs.org/

Note: In May 2020, OpenPBS became the new name for the PBS Professional Open Source Project. (PBS Professional will be used to refer to the commercial version; OpenPBS to the Open Source version -- same code, easier naming.) As there are many parts to the project, it will take several weeks to change the name in all places, so you will continue to see references to PBS Pro -- stay tuned.

What is OpenPBS?

OpenPBS® software optimizes job scheduling and workload management in high-performance computing (HPC) environments – clusters, clouds, and supercomputers – improving system efficiency and people’s productivity. Built by HPC people for HPC people, OpenPBS is fast, scalable, secure, and resilient, and supports all modern infrastructure, middleware, and applications.

Scalability: supports millions of cores with fast job dispatch and minimal latency; tested beyond 50,000 nodes
Policy-Driven Scheduling: meets unique site goals and SLAs by balancing job turnaround time and utilization with optimal job placement
Resiliency: includes automatic fail-over architecture with no single point of failure – jobs are never lost, and jobs continue to run despite failures
Flexible Plugin Framework: simplifies administration with enhanced visibility and extensibility; customize implementations to meet complex requirements
Health Checks: monitors and automatically mitigates faults with a comprehensive health check framework
Voted #1 HPC Software by HPC Wire readers and proven for over 20 years at thousands of sites around the globe in both the private sector and public sector

Community and Ways to Participate

OpenPBS is a community effort and there are a variety of ways to engage, from helping answer questions to benchmarking to developing new capabilities and tests. We value being aggressively open and inclusive, but also aggressively respectful and professional. See our Code of Conduct.

The best place to start is by joining the community forum. You may sign up or view the archives via:

Announcements -- important updates relevant to the entire PBS Pro community
Users/Site Admins -- general questions and discussions among end users (system admins, engineers, scientists)
Developers -- technical discussions among developers

To dive in deeper and learn more about the project and what the community is up to, visit:

Contributor’s portal -- includes roadmaps, processes, how to articles, coding standards, release notes, etc (Uses Confluence)
Source code -- includes full source code and test framework (Uses Github)
Issue tracking system -- includes bugs and feature requests and status (Uses Github). Previously, we used JIRA, which contains older issues.

OpenPBS is also integrated in the OpenHPC software stack. The mission of OpenHPC is to provide an integrated collection of HPC-centric components to provide full-featured HPC software stacks. OpenHPC is a Linux Foundation Collaborative Project. Learn more at:

Our Vision: One Scheduler for the whole HPC World

There is a huge opportunity to advance the state of the art in HPC scheduling by bringing the whole HPC world together, marrying public sector innovations with private sector enterprise know-how, and retargeting the effort wasted re-implementing the same old capabilities again and again towards pushing the outside of the envelope. At the heart of this vision is fostering common standards (at least defacto standards like common software). To this end, Altair has made a big investment by releasing the PBS Professional technology as OpenPBS (under an Open Source license to meet the needs of the public sector), while also continuing to offer PBS Professional (under a commercial license to meet the needs of the private sector). One defacto standard that can work for the whole HPC community.

Current Build status

openpbs's People

Contributors

Stargazers

Watchers

Forkers

arungrover visheshh gajendrasharmagit suresh-thelkar bhroam kjakkali nithinj bayucan agrawalravi90 jonshelley wfang sourceonly cesnet saritakh ggouaillardet anamikau subhasisb vinodchitrali rampranesh agurban ashwathraop swechha shivanshuarora9 sid2364 bharatks mike0042 wjkennedy sstrato klovely dilip-krishnan angelos-se riyazhakki vccardenas altair4 mahalty ugiwgh nishiya savok pepe5 lcnja bremanandjk sudhirn01 csimmend johnlih dariod ashish-anand danjhamilton phhere chilly vishwaks honza801 haodongguo austinzh a1rohner jitendra-rajak nqtung bhagat-rajput jlcczzj dinesh1306 nagyistge khurtado vinayjanardhanachari rjms-bull-tutorial justintian cheealtair minghui-liu eqqlyz ewail npadole20 ctminh jesszgc kimloai jermth hellogitcn robottomw jnewman-altair suminhan biozit vishrutarora shrini-h chq7920 honeyxdyj manjunathdoddamgit allen-shao tpai26 chpchep tushar-raj123 wahello raymondhe praveenhandigol foobarquaxx heribertosgp vstumpf zhaohengz shilpakodli scottaltair latha-subramanian tpatki ptosco brewlius-cesar

openpbs's Issues

Building docker image fails

I was trying to build a docker image with a newer version of PBS but when i run a docker build,
i always get the following error:

checking for daemon coredump limit... unlimited
checking for a Python interpreter with version >= 3.5... none
configure: error: no suitable Python interpreter found
make: *** No rule to make target `dist'. Stop.
cp: cannot stat 'pbspro-*.tar.gz': No such file or directory
sh: line 0: fg: no job control
sh: line 0: fg: no job control
error: File /root/rpmbuild/SOURCES/pbspro-19.0.0.tar.gz: No such file or directory

Steps to reproduce:
git clone https://github.com/PBSPro/pbspro.git
mv pbspro/docker/centos7/Dockerfile.build pbspro/docker/centos7/Dockerfile

edit pbspro/docker/centos7/build.sh:
#!/bin/bash
cd /src/pbspro
./autogen.sh
./configure PBS_VERSION=19.1.3 --prefix=/opt/pbs
make dist
mkdir /root/rpmbuild /root/rpmbuild/SOURCES /root/rpmbuild/SPECS
cp pbspro-*.tar.gz /root/rpmbuild/SOURCES
cp pbspro-rpmlintrc /root/rpmbuild/SOURCES
cp pbspro.spec /root/rpmbuild/SPECS
cd /root/rpmbuild/SPECS
rpmbuild -ba pbspro.spec

docker build -t test/pbspro:19.1.3 pbspro/docker/centos7/

MUNGE authentication fails when a same GID maps to more than one group names.

Current MUNGE authentication functionality transmits group name from qsub to pbs_server and the server will reject a job if the qsub host's primary group NAME does not match the primary group name on the server host. We actually do not need to consider group name or GID when authenticating as seen in line https://github.com/PBSPro/pbspro/blob/c69bd9160a942641150859ad920806721cf1944c/src/lib/Libutil/munge_supp.c#L247

Using queuejob and periodic event for the same hook encounters issues

Goal: To be able to check each job select statement before allowing the job to run (Note: some jobs do not have select statements until after they have reached the server)

I using PBS 18.1.3 on CentOS 7.6. I am trying to use a queuejob hook to check job requests and modify the select statement as needed. For some jobs that use legacy (Torque) syntax I need to let them get queued before I can check the select statement. In this case, I use the queuejob hook to put a hold on the job. I then want to use the same hook in a periodic event to then find all jobs in a specific hold state (I use "so") and check their select statement, modify it, and release the hold.

This works as expected initially. However, after some time (1-2hrs) I see that the hook stops running periodically (as determined from the server logs). If I restart the server then the hook stars running again.

Also, I have seen jobs get rejected.

[testusera@ip-0A021004 default]$ qsub -lselect=2:ncpus=60 test.pbs
qsub: queuejob event: rejected request

If I remove the periodic event from the hook and then resubmit the job it submits without issue.

Any thoughts?

Travis installs python 3.5 on debian instead of 3.6

After cc260ef, pbs was updated to use python 3.6.
Our travis script says to install python3:
https://github.com/PBSPro/pbspro/blob/06c0781501e77048cb30371ea37559bf89701866/.travis/do.sh#L11-L15
Unfortunately, debian's python3 package is still python3.5 instead of python3.6.
This causes the travis failure in 17e4803

qmgr man pages namespace conflict with postfix's qmgr

Can I propose that the PBSPro man page for qmgr be re-named to pbs-qmgr to prevent the conflict with the Postfix queue manager

server attribute 'jobscript_max_size' not working properly.

Steps to reproduce:

set 'jobscript_max_size' with some value
submit job with job-script have size more than settled jobscript_max_size expected is job should be rejected.
unset 'jobscript_max_size'
submit job with job-script have size more than settled jobscript_max_size expected is job should be submitted and running, but job gets terminated immediately with Job Exit status 127.

I submitted job with -koe option and keeping server and mom logs at high value and got error message in job error file that script file not available in /var/spool/pbs/mom_priv/jobs/ for resprctive job.

I restarted PBS server and submit new job with script, then job was running fine and finished normally.

qstat JSON output and (non-)numeric environment variables

If an environment variable's value is all digits, but is not a valid python number, the JSON produced by qstat -W json cannot be parsed by python's json package.

Repeat by:

16:06 testpbs build.$ qsub -v foo=005 -h -- /usr/bin/sleep 100
53.testpbs.nas.nasa.gov
16:06 testpbs build.$ qstat -f -F json 53 > foo.json
16:06 testpbs build.$ grep foo foo.json
                "foo":005,
            "Submit_arguments":"-v foo=005 -h -- /usr/bin/sleep 100",
16:06 testpbs build.$ python
Python 2.7.13 (default, Jan 11 2017, 10:56:06) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> fd = open('foo.json')
>>> zz = json.load(fd)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 41 column 24 (char 1676)
>>>

The fix is to force environment variables to be written as strings:

$ git diff src/cmds/qstat.c 
diff --git a/src/cmds/qstat.c b/src/cmds/qstat.c
index 14619ccd..62a7f760 100644
--- a/src/cmds/qstat.c
+++ b/src/cmds/qstat.c
@@ -424,7 +424,7 @@ prt_attr(char *name, char *resource, char *value, int one_line) {
                                        *buf++ = *value++;
                                }
                                *buf = '\0';
-                               if (add_json_node(JSON_VALUE, JSON_NULL, JSON_ESCAPE, key, val) == NULL)
+                               if (add_json_node(JSON_VALUE, JSON_STRING, JSON_ESCAPE, key, val) == NULL)
                                        exit_qstat("out of memory");
                                if (*value != '\0')
                                        value++;

PBS not releasing dependent job

Dear all,
PBS isn't releasing a job depending on another, when the jobs are submitted from a compute node rather than the master.

I generated a very simple "pre" and "post" job to reproduce the issue, "pre":

#PBS -N pre
#PBS -l walltime=01:00:00
#PBS -q workq
#PBS -j oe
#PBS -l select=1:ncpus=1:mpiprocs=1

echo "$(date)"
echo $HOSTNAME
sleep 1m
echo "$(date)"

post:

#PBS -N post
#PBS -l walltime=01:00:00
#PBS -q workq
#PBS -j oe
#PBS -l select=1:ncpus=1:mpiprocs=1

echo "$(date)"
echo $HOSTNAME
sleep 1m
echo "$(date)"

Submitted them with:

$ qsub dep_test_pre.sh
$ qsub -W depend=afterok:<JOB-ID-HERE> dep_test_post.sh

After 1m, once the "pre" job completes, PBS notices and signals for "post" to be released:

$ grep '18.ip-' /var/spool/pbs/server_logs/20200417
04/17/2020 17:19:52;0100;Server@ip-0a000205;Job;18.ip-0A000205;enqueuing into workq, state 2 hop 1
04/17/2020 17:19:52;0008;Server@ip-0a000205;Job;18.ip-0A000205;Job Queued at request of daniele@ip-0a000207.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.net, owner = daniele@ip-0a000207.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.net, job name = post, queue = workq
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Jobs: ['18.ip-0A000205']
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Cmd: ['/opt/pbs/bin/qstat', '-f', '-F', 'json', '18.ip-0A000205']
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Key: 18.ip-0A000205
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - full cmd ['/opt/pbs/bin/qalter', u'-lselect=1:ncpus=1:mpiprocs=1', u'-lplace=scatter:group=group_id', u'18.ip-0A000205']
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Cmd: ['/opt/pbs/bin/qalter', u'-lselect=1:ncpus=1:mpiprocs=1', u'-lplace=scatter:group=group_id', u'18.ip-0A000205']
04/17/2020 17:20:00;0008;Server@ip-0a000205;Job;18.ip-0A000205;Job Modified at request of root@ip-0a000205.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.net
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Release the hold on job 18.ip-0A000205
04/17/2020 17:20:00;0006;Server@ip-0a000205;Hook;Server@ip-0a000205;cycle_periodic_hook_place - Cmd: ['/opt/pbs/bin/qrls', '-h', 'o', u'18.ip-0A000205']
04/17/2020 17:20:00;0008;Server@ip-0a000205;Job;18.ip-0A000205;Holds o released at request of root@ip-0a000205.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.net

but nothing ever happens as the job is still being held:

$ qstat -f 18
Job Id: 18.ip-0A000205
    Job_Name = post
    Job_Owner = [email protected]
        dapp.net
    job_state = H
    queue = workq
    server = ip-0A000205
    Checkpoint = u
    ctime = Fri Apr 17 17:19:52 2020
    depend = afterok:17.ip-0A000205.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.clou
        dapp.net@ip-0a000205.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.ne
        t
    Error_Path = ip-0a000207.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.ne
        t:/shared/home/daniele/good_tests_compute/post.e18
    Hold_Types = s
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Fri Apr 17 17:20:00 2020
    Output_Path = ip-0a000207.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp.n
        et:/shared/home/daniele/good_tests_compute/post.o18
    Priority = 0
    qtime = Fri Apr 17 17:19:52 2020
    Rerunable = True
    Resource_List.mpiprocs = 1
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.place = scatter:group=group_id
    Resource_List.select = 1:ncpus=1:mpiprocs=1
    Resource_List.ungrouped = false
    Resource_List.walltime = 01:00:00
    substate = 22
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/shared/home/daniele,PBS_O_LOGNAME=daniele,
        PBS_O_WORKDIR=/shared/home/daniele/good_tests_compute,
        PBS_O_LANG=C.UTF-8,
        PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/cycl
        e/jetpack/bin:/opt/ibutils/bin:/opt/pbs/bin:/shared/home/daniele/.local
        /bin:/shared/home/daniele/bin,PBS_O_MAIL=/var/spool/mail/daniele,
        PBS_O_QUEUE=workq,
        PBS_O_HOST=ip-0a000207.htpsa2udrriuvayrbdpyqjz5ae.ax.internal.cloudapp
        .net
    Submit_arguments = -W depend=afterok:17 ../dep_test_post.sh
    project = _pbs_project_default

Please note this only fails when executing the qsub from a compute node, when submitting from the master all goes smoothly.
In order to be able to submit from the compute node sudo qmgr -c "s s flatuid = true" on the master has to be executed, even though the IDs and GIDs are the same for that user, might that be a factor?

Other details:

running on Azure cloud;
cluster created by CycleCloud;
PBS version: 18.1.4, but a customer has the same issue with a slightly older version;
CentOS 7.7;
SELinux set to Permissive.

Any suggestions on what it might be or how to troubleshoot further?

Thanks in advance for any help you might provide,
Daniele

PBS mail subject change

syntax to modify the PBS mail subject and body line from
To: [email protected]
Subject: PBS JOB 19.master

PBS Job Id: 19.master
Job Name: testjob1

using pbspro-server-19.1.1-0.x86_64

Errors that show up when trying to compile PBSPro

Can someone please assist with resolving the errors that I receive below when trying to compile PBSPro:
root@:/opt/pbspro-master# make
Making all in scheduler
make[2]: Entering directory '/opt/pbspro-master/src/scheduler'
Makefile:769: ../../src/lib/Libpython/.deps/libpbs_sched_a-shared_python_utils.Po: No such file or directory
make[2]: *** No rule to make target '../../src/lib/Libpython/.deps/libpbs_sched_a-shared_python_utils.Po'. Stop.
make[2]: Leaving directory '/opt/pbspro-master/src/scheduler'
Makefile:511: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/opt/pbspro-master/src'
Makefile:547: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

PBSPro: inflated memory readout for Singularity jobs

Hi,
I have a problem running jobs in Singularity containers due to an inflated memory readout causing the scheduler to kill the job. I have observations from two systems running the same job (same input data, same application container version, same command line):

System 1:

VM: 36 cores, ~1.4 TB RAM
CentOS Linux release 7.6.1810 (Core)
singularity version 3.4.2-1.1.el7
PBS/TORQUE Version: 6.1.1.1
Several jobs of the same type completed w/o being killed. I once tracked memory consumption for one job/one specific input dataset every minute in a separate script, indicating that the job should complete with a peak consumption of ~600 GB

System 2:

72 cores (only 36 or 24 used for testing), ~1.4 TB RAM
CentOS Linux release 7.7.1908 (Core)
singularity version 3.5.2
PBSPro pbs_version = 18.2.5.20190913061152
This system is fully managed, so I only have the info from the local tech support that the job was killed by the scheduler (and not by the cgroup) after requesting > 2 TB RAM. The qstat output always indicated a peak of ~1 TB, but I was told that qstat is only updating its info every couple of minutes, so it does not tell me much.

System 1 was used for testing, and since all jobs always completed, I did not retain any (qstat) logs. The tech support for System 2 has no idea what's wrong, and it looks like they can't help me much if I don't find an angle where to start here. So, does anybody have any experience with PBSPro plus Singularity, or has even observed this behavior? Thanks in advance.

Best,
Peter

pbs_cgroups hook fails when CONFIG_MEMCG_SWAP_ENABLED is disabled

PBS Pro version: 18.1.4
OS: CentOS 7, SUSE Enterprise Linux 12/15, Ubuntu 16.04/18.04

Summary:
When the pbs_cgroups hook is enabled, and the Kernel (in compute nodes) is configured to disable cgroups memory swap accounting (CONFIG_MEMCG_SWAP_ENABLED Kernel config is either unset or set to n), the hook fails and all the jobs are placed on (H)old state indefinitely.

Steps to reproduce:

(Headnode) Enable pbs_cgroups hook: qmgr -c "set hook pbs_cgroups enabled = true"
(Compute node) Disable cgroups memory swap accounting by setting GRUB_CMDLINE_LINUX="swapaccount=0" in /etc/default/grub
(Compute node) Update grub config and reboot update-grub && reboot
(Headnode) Submit a job
(Headnode) The job will be placed on H state indifinitely

More information:
Mom log shows this error message:
07/16/2019 20:00:22;0080;pbs_python;Hook;pbs_python;['Traceback (most recent call last):', ' File "<embedded code object>", line 3971, in main', ' File "<embedded code object>", line 1763, in __init__', ' File "<embedded code object>", line 2160, in _get_assigned_cgroup_resources', "IOError: [Errno 2] No such file or directory: '/sys/fs/cgroup/blkio,cpuacct,memory,freezer/pbspro/0.headnode/memory.memsw.limit_in_bytes'"]

(Notice that, for some distros/kernels, the CONFIG_MEMCG_SWAP_ENABLED config is disabled by default)

pbsnodes shows state = initializing and down

I have a new pbs server + execution node setup. When I create node in qmgr, pbsnodes -a shows this:

[root@gdcrstoap702 pbs]# pbsnodes -a
gdcrstoap702
Mom =
Port = 15002
pbs_version = 18.1.2
ntype = PBS
state =
pcpus =
comment = node down: communication closed
resv_enable = True
sharing =
last_state_change_time = Thu Jan 1 01:00:00 1970
last_used_time = Thu Jan 1 01:00:00 1970
resources_available.arch = linux
resources_available.host = gdcrstoap702
resources_available.mem = 24777144kb
resources_available.ncpus = 5
resources_available.vnode =
resources_available.accelerator_memory = 0kb
resources_available.hbmem = 0kb
resources_available.naccelerators = 0
resources_available.vmem = 0kb
resources_available.department =
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.vmem = 0kb

gdcrstoap011
Mom = gdcrstoap011, gdcrstoap702
Port = 15002
pbs_version = 18.1.2
ntype = PBS
state = initializing
pcpus = 2
resources_available.arch = linux
resources_available.host = gdcrstoap011
resources_available.mem = 7999928kb
resources_available.ncpus = 2
resources_available.vnode = gdcrstoap011
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Nov 20 09:17:43 2019

and after a while, the gdcrstoap011 node shows with state = down and the comment says "node down: communication closed"

my /etc/pbs.conf on server:
PBS_SERVER=gdcrstoap702
PBS_MOM_NODE_NAME=gdcrstoap011
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

and on execution node:
PBS_SERVER=gdcrstoap702
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0

all ports and connectivity work, name resolution also works fine.
am I missing some piece of configuration?

PTL expect() with MATCH_RE & op=NE doesn't seem to work

When calling expect() with MATCH_RE in the attribute list (e.g - self.server.expect(NODE, {'state': (MATCH_RE, 'down')}, op=NE, id=self.shortname)), MATCH_RE seems to override op=NE:

2019-03-08 07:31:17,308 INFO status on pbspro: node pbspro
2019-03-08 07:31:17,308 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:17,324 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:17,339 INFO expect on server pbspro: state ~ down node pbspro got: state = free
2019-03-08 07:31:17,844 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:17,861 INFO expect on server pbspro: state ~ down node pbspro attempt: 2 got: state = free
2019-03-08 07:31:18,365 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:18,382 INFO expect on server pbspro: state ~ down node pbspro attempt: 3 got: state = free
2019-03-08 07:31:18,887 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:18,903 INFO expect on server pbspro: state ~ down node pbspro attempt: 4 got: state = free
2019-03-08 07:31:19,410 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:19,429 INFO expect on server pbspro: state ~ down node pbspro attempt: 5 got: state = free
2019-03-08 07:31:19,932 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:19,948 INFO expect on server pbspro: state ~ down node pbspro attempt: 6 got: state = free
2019-03-08 07:31:20,454 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:20,473 INFO expect on server pbspro: state ~ down node pbspro attempt: 7 got: state = free
2019-03-08 07:31:20,979 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
2019-03-08 07:31:20,998 INFO expect on server pbspro: state ~ down node pbspro attempt: 8 got: state = free
2019-03-08 07:31:21,503 INFOCLI pbspro: /opt/pbs/bin/pbsnodes -s pbspro -v pbspro
..... (and so on until it fails)

errors show up when trying to compile PBSPro

When trying to compile PBSPro I keep getting these errors. Can someone please assist with what could be causing this issue:

root@:/opt/pbspro-master# make

Generating ecl_resc_def_all.c from ../../../src/lib/Libattr/master_resc_def_all.xml
Traceback (most recent call last):
File "../../../buildutils/attr_parser.py", line 78, in iter
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "../../../buildutils/attr_parser.py", line 947, in
main(sys.argv[1:])
File "../../../buildutils/attr_parser.py", line 928, in main
resc_attr(m_file, s_file, e_file)
File "../../../buildutils/attr_parser.py", line 529, in resc_attr
for case in switch(mflg):
RuntimeError: generator raised StopIteration
Makefile:3688: recipe for target 'ecl_resc_def_all.c' failed
make[3]: *** [ecl_resc_def_all.c] Error 1
make[3]: Leaving directory '/opt/pbspro-master/src/lib/Libpbs'
Makefile:509: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/opt/pbspro-master/src/lib'
Makefile:511: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/opt/pbspro-master/src'
Makefile:547: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

Why PBS can not count the memory usage correctly when work with MPI

Here is a simple experiment I tried:

Given a task called "sim.exe" which doing a model simulation, I then use MPI to launch x "sim.exe" simultaneously on one node (shared memory system). I have tried four different runs with x to be different value (e.g., 1, 4, 8, 16). Then I check the memory usage through the PBS report "memory used" and "Vmem used". I observed that for these different runs the "memory used" and "Vmem used" keep the same not changed with "mem" = 8,432 KB and "vmem" = 489,716 KB.

However, if I do not use MPI but other parallelization techniques "threading" or "multiprocessing" in python, and launch x "sim.exe" simultaneously, the "mem" and "vmem" usage count returned by PBS is increasing linearly with the value of x, which is more as what I would expect.

So why "mem" and "vmem" keeps the same even though there are more workload is doing? what does the "mem used" and "vmem used" return by PBS report means when MPI is used?

Can anyone provide some answers or suggestions on this?

The original post is here: How to understand PBS output “mem” and “vmem” keep the same when the task is x-fold increased with mpirun -np x task

unauthorized host message after submitting job

I have a multihomed head node running PBS Pro CE 18.1.4. We can submit jobs, but they don't run.
In sched_logs, we see messages like the following:

05/18/2020 19:10:40;0001;pbs_sched;Svr;pbs_sched;badconn, headnode.company.com on port 896 unauthorized host

A reverse DNS lookup on the head node's cluster IP address returns "headnode.eth.local", whether I use "nslookup" or "dig -x". No other IP address on the head node gets a result from a reverse DNS lookup.

/etc/resolv.conf is set up with eth.local as the first search domain

The hosts file line for the IP address is "IPADDRESS headnode headnode.eth.local headnode.company.com".
Changing the position of the 2nd and 3rd names in that line makes no difference.

I've tried following pages I've found regarding multi-homed head nodes, but so far have not been successful. Is there something official regarding setting this up? Where is the scheduler getting the name "headnode.company.com", rather than "headnode.eth.local"?

configure and make errors on Ubuntu 18.04.4 LTS

Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-88-generic x86_64)

at step 6 of INSTALLATION GUIDE
$ ./configure --prefix=/opt/pbs

configure: error: Python must be version 2.6 or 2.7 appeared

fixed with
$ sudo apt install python-minimal

then make error appeared

Making all in Libpython
make[3]: Entering directory '/home/adminrig/github/pbspro-19.1.3/src/lib/Libpython'
gcc -DHAVE_CONFIG_H -I. -I../../../src/include  -I../../../src/include -I/usr/include/python2.7 -I/usr/include/python2.7   -g -O2 -MT libpbspython_a-common_python_utils.o -MD -MP -MF .deps/libpbspython_a-common_python_utils.Tpo -c -o libpbspython_a-common_python_utils.o `test -f 'common_python_utils.c' || echo './'`common_python_utils.c
In file included from common_python_utils.c:40:0:
../../../src/include/pbs_python_private.h:64:10: fatal error: Python.h: No such file or directory
 #include <Python.h>
          ^~~~~~~~~~
compilation terminated.
Makefile:567: recipe for target 'libpbspython_a-common_python_utils.o' failed
make[3]: *** [libpbspython_a-common_python_utils.o] Error 1
make[3]: Leaving directory '/home/adminrig/github/pbspro-19.1.3/src/lib/Libpython'
Makefile:504: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/home/adminrig/github/pbspro-19.1.3/src/lib'
Makefile:506: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/adminrig/github/pbspro-19.1.3/src'
Makefile:542: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

Output from pbsnodes incorrect when output in JSON format

When running the command pbsnodes -vSj a different result is obtained when outputting in JSON format vs. tabular. Here is an example:

pbsnodes -vSj node0017 (default tabular):

                                                        mem       ncpus   nmics   ngpus
vnode           state           njobs   run   susp      f/t        f/t     f/t     f/t   jobs
--------------- --------------- ------ ----- ------ ------------ ------- ------- ------- -------
node0017        offline              2     2      0     3gb/63gb    1/20     0/0     0/0 300065,300640

pbsnodes -vSj -F json node0017:

{
    "timestamp":1570045708,
    "pbs_version":"18.1.4",
    "pbs_server":"bright01-thx",
    "nodes":{
        "node0017":{
            "State":"offline",
            "Total Jobs":8,
            "Running Jobs":8,
            "Suspended Jobs":0,
            "mem f/t":"3gb/63gb",
            "ncpus f/t":"1/20",
            "nmics f/t":"0/0",
            "ngpus f/t":"0/0",
            "jobs":[
                "300065.bright01-thx",
                "300640.bright01-thx",
                "30right01-ght01-thx",
                "300640.bright01-thx",
                "300640.bright01-thx",
                "300640.bright01-thx",
                "300640.bright01-thx",
                "300640.bright01-thx"
            ]
        }
    }
}

From the above, when output is tabular, pbsnodes shows 2 jobs on the node, but when output is JSON it shows 8 jobs, 7 of which are the same job, and the third one in the list is garbled. qstat shows that the actual number of jobs on the node is 2.

qstat -a -n | grep -B 1 node0017

300065.bright01 username def-medi E-Ph_3341_   3608   1   1   35gb 72:00 R 08:05
   node0017/0
--
300640.bright01 username def-medi amine_prop 194350   1  18   25gb 72:00 R 51:26
   node0017/1*18

There are many other nodes displaying the same behavior, but I have only shown one node. One thing they have in common is that all of these nodes are offline. We put them offline to drain the jobs so that we could reboot them.

I'm not sure what the significance is of this issue, but I thought I would report it. I guess if someone were writing scripts using pbsnodes as a JSON API, then the misreporting of total job numbers and duplicate IDs could be an issue.

cgroup hook should tolerate missing hook events

The cgroup hook relies on certain hook events to be present. This prevents sites from taking a newer version of the hook and running it with an older version of PBS Pro. For example, PBS Pro 19.x provides an execjob_abort event that is not present in 18.x. If the hook checked for the existence of an event prior to registering it then it would be backward compatible. One way to accomplish this is as follows:

import pbs
if ’EXECJOB_ABORT’ in dir(pbs):
    self.hook_events[pbs.EXECJOB_ABORT] = { 
       'name': 'execjob_abort', 
       'handler': self._execjob_end_handler 
    }

It would also be possible to place the registration within a try/except block and ignore any errors if the event is not present.

[qrerun][exit status] After requeue a running job, the job exit status is always -11 until the job is done

The steps are:

Submit a job and it starts to run.
After requeue the running job, the job exit status is always -11 until the job is done

Exit_status = -11

The expected behavior is that when the job is running, remove the value or assign it to 0.

qsub blocking is not working

Hello,
Its not probably a bug, but still I am unsure what is causing the problem.
I am submitting a code using qsub from python3 as( here is the complete code):

scfc = ["qsub", "-Wblock=true", "script.sh"]
optstate = ["opt1", "opt2", "opt3"]

          for sstate in optstate:
            subprocess.call(scfc)
            sdir = optstate.index(sstate)
            print(sstate)
            genincar2(sdir)
            shutil.copy2("INCAR", "INCAR"+"."+str(sstate))
            shutil.copy2("CONTAR", "CONTCAR"+"."+str(sstate))

and the script.sh is:

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Test
#PBS -l select=2:ncpus=24:mpiprocs=24
#PBS -q workq
#PBS -joe
#PBS -V
export I_MPI_FABRICS=shm:tmi
export I_MPI_PROVIDER=psm2
export I_MPI_FALLBACK=0
export KMP_AFFINITY=verbose,scatter
module load intel/2018
module load vasp/5.4.4
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > pbs_nodes
echo Working directory is $PBS_O_WORKDIR
NPROCS=`wc -l < $PBS_NODEFILE`
NNODES=`uniq $PBS_NODEFILE | wc -l`
mpirun -np $NPROCS --machinefile $PBS_NODEFILE vasp_std

I am expecting with -Wblock=true (and also, subprocess.call), the code to wait to finish the vasp_std code before going to sdir = optstate.index(sstate) line.

But this is not the case, and getting error as CONTCAR is not generated yet.

Can someone kindly help?

syntax issue in install_db file

While running install_db for database creation, this script will preserve the existing password file.
There is a syntax issue while verifying the "upgrade" variable.

Execution logs:

[root@physics scripts]# /opt/pbs/libexec/install_db
/opt/pbs/libexec/install_db: line 223: [0: command not found
Creating the PBS Data Service...
Starting PBS Data Service..
Stopping PBS Data Service..
waiting for server to shut down.... done
server stopped

It's not harming in the creation of a database for vanilla install, it causes an issue while saving the password for an upgrade scenario.

Test's should have a way to specify and configure the test parameters.

PTL tests should have a way to specify test parameters while developing the tests and they should be configurable at execution time.

Documentation

I am familiar with the PDF documentation for PBS Pro, but is there a website with the documentation that would be easier to search and view? If not, could I convert some or all of it to markdown in order to easily display on the web? I was reading the copyright notice on the user guide and am unclear whether that applies to the documentation or just the software.

Ideally, the Altair folks could compile the documentation into markdown or HTML and add it to a new GitHub repository and serve it using GitHub Pages.

Pbs_mom dumps core when jobs are preempted / canceled / qrerun

This community discussion contains all the details.

affected: master

Building from source missing file

When building from source (with make or make dist) the build fails due to missing src/resmom/mom_mach.c and mom_mach.h in either src/include or src/resmom/.

I was following the instructions here to build for aarch64 running CentOS.

jobscript_max_size is not working.

Job runs despite limiting jobscript_max_size to 1b.

[test@ohpc137pbsib-sms ~]$ qsub --version
pbs_version = 19.1.1
[test@ohpc137pbsib-sms ~]$

[root@ohpc137pbsib-sms ~]# qmgr -c "p s" | grep jobscript_max_size
[root@ohpc137pbsib-sms ~]# qmgr -c "set server jobscript_max_size = 1"
[root@ohpc137pbsib-sms ~]# qmgr -c "p s" | grep jobscript_max_size
set server jobscript_max_size = 1b
[root@ohpc137pbsib-sms ~]#

[test@ohpc137pbsib-sms ~]$ wc -c yes.sh
3897 yes.sh
[test@ohpc137pbsib-sms ~]$

[test@ohpc137pbsib-sms ~]$ ls -l yes.sh
-rw-rw-r-- 1 test test 3897 Jun 10 15:00 yes.sh
[test@ohpc137pbsib-sms ~]$

[test@ohpc137pbsib-sms ~]$ qsub yes.sh
1101.ohpc137pbsib-sms

[test@ohpc137pbsib-sms` ~]$ qstat -an

ohpc137pbsib-sms:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1101.ohpc137pbs test     workq    yes         29423   1   1    --    --  R 00:00
   ohpc137pbsib-c001/1
[test@ohpc137pbsib-sms `~]$

qmgr configuration changes should be logged

I recently had the issue that we understood cpuset to be enabled. On inspection we discovered that it wasn't. We don't know who or how this was changed.

It would be great if configuration changes in qmgr were logged somewhere.

This issue opened on recommendation

scheduler may oversubscribe node with subjob

I observed some oversubscribing. The scheduler may probably oversubscribe the node with a subjob. The problem is probably caused by the filter added in #872 . The scheduler can miscount free resources now. The scheduler considers all not running subjobs to be gone (has_ghost_job==1). In order to fix this and do not break #791, I would suggest changing the filter added in #872 into filtering jobs in JOB_STATE_EXPIRED only.

Please, could someone check it with me (@arungrover)? I am not raising PR because I am not really sure yet.

./autogen.sh gives subdirectory option error

Hi All when I am trying to install pbspro I get this error while installing.

src/lib/Libpbs/Makefile.am:57: warning: source file '../Libattr/attr_fn_arst.c' is in a subdirectory, src/lib/Libpbs/Makefile.am:57: but option 'subdir-objects' is disabled automake: warning: possible forward-incompatibility.

Does anyone know how I fix this? I read that I am supposed to edit the configure.ac file and allow subdir-objects but I don't know how exactly to do that.

add autogen.sh to the %build section of the spec file

we need to add
[ -f configure ] || ./autogen.sh
to the %build section of both pbspro.spec and pbspro.spec.in

With above modification,
we can build RPMs with rpmbuild -ta SOURCE/pbspro-19.1.2.tar.gz
both
from
https://github.com/PBSPro/pbspro/releases/download/v19.1.2/pbspro-19.1.2.tar.gz
and
from
https://github.com/PBSPro/pbspro/archive/v19.1.2.tar.gz

And https://github.com/PBSPro/pbspro/releases/download/v19.1.2/pbspro-19.1.2.tar.gz may not be required to be listed.

qsub: cannot access script file when TMPDIR is set to a long value

Hello,
I'm reporting an issue with qsub whereby setting TMPDIR to a long string is resulting in an error "qsub: cannot access script file". In Cadence's case, it is resulting in a segfault. I have verified the same behavior appears in 19.1.1 and 19.1.2.

jnewman@hydra:~> qstat --version
pbs_version = 19.1.1
jnewman@hydra:~> qsub -l select=1:ncpus=1:mem=1024MB -mn -o /dev/null -e /dev/null -k oed -W umask=022 -- sleep 5
3012.hydra
jnewman@hydra:~> export TMPDIR=/dv/scratch/relntr/vip_logs/patch_xcelium5_gather_update.073119-110237/venus_in_grid_scratch_2019-Aug-05-20-12-50_16701/tmpdir9999
jnewman@hydra:~> qsub -l select=1:ncpus=1:mem=1024MB -mn -o /dev/null -e /dev/null -k oed -W umask=022 -- sleep 5                                               
qsub: cannot access script file
jnewman@hydra:~> export TMPDIR=/dv/scratch/relntr/vip_logs/patch_xcelium5_gather_update.073119-110237/venus_in_grid_scratch_2019-Aug-05-20-12-50_16701         
jnewman@hydra:~> qsub -l select=1:ncpus=1:mem=1024MB -mn -o /dev/null -e /dev/null -k oed -W umask=022 -- sleep 5                                               
qsub: cannot access script file
jnewman@hydra:~> touch $TMPDIR/test.file
jnewman@hydra:~> rm -f $TMPDIR/test.file
jnewman@hydra:~> export TMPDIR=/dv/scratch/relntr/vip_logs/patch_xcelium5_gather_update.073119-110237
jnewman@hydra:~> qsub -l select=1:ncpus=1:mem=1024MB -mn -o /dev/null -e /dev/null -k oed -W umask=022 -- sleep 5
3014.hydra

Job array subjobs are not being held after failing/requeueing 20 times.

Currently run_count attribute of an instantiated subjob is not incremented for each retry by server to run it on mom and hence any exception or rejection by Mom to run it will result in to and fro transfer of subjob between server and mom for eternity causing unnecessary mom and server cpu time and log space.

support for DRMAA?

Hi,
This is just a question I can't find an answer to elsewhere, is there support for DRMAA for PBSPro (or plans for)?
There's a third party library:
http://apps.man.poznan.pl/trac/pbs-drmaa/wiki
but this no longer seems to work for us (I'm a PBSPro user) and unsure if it is being maintained.
Best wishes,
Antonio

"Job reported running on node no longer exists or is not in running state" msgs

Since switching to version 19, we are getting 100's of thousands of these scheduler messages per day. The message is issued by collect_jobs_on_nodes() when it finds a node with a running job where the job is not in a list of jobs given to collect_jobs_on_nodes(). The list of jobs comes from a call to resource_resv_filter() to find all jobs that are not in reservations.

This means that the message is listed for every node that has a job running in a reservation. This is a perfectly normally situation, and it is not clear why the code is remarking on it. The comments right before the message say:

/* Race Condition occurred: nodes were queried when a job existed.
 * Jobs were queried when the job no longer existed.  Make note
 * of it on the job so the node's resources_assigned values can be
 * recalculated later.
 */

The case described by the comment is not what is happening for us when the messages are output.

Update INSTALL to reflect a latest github download

Open a terminal as a normal (non-root) user, unpack the PBS Pro
tarball, and cd to the package directory.

tar -xpvf pbspro-18.1.0.tar.gz here it should be generic
cd pbspro-18.1.0

sched_preempt_enforce_resumption broken on master branch

Steps to reproduce issue:

Single node setup with 4 ncpus; having expressq with >150 priority and sched attribute sched_preempt_enforce_preemption enabled
Submit normal job asking 4 ncpus & 300secs walltime and 2 single ncpus job one asking 50secs walltime and other asking 200secs walltime
=> j1 runs, j2, j3 are queued
Submit high priority job to expressq asking 2 ncpus & 100secs walltime
=> j4 runs after preempting j1. Here due to sched_preempt_enforce_resumption j3 should not have run; but it runs.

[saritah@nx15 jobs]$ qsub -lselect=ncpus=4 -l walltime=300 -- /bin/sleep 500
0.nx15
[saritah@nx15 jobs]$ sleep 1
[saritah@nx15 jobs]$ qsub -lselect=ncpus=1 -l walltime=50 -- /bin/sleep 500
1.nx15
[saritah@nx15 jobs]$ sleep 1
[saritah@nx15 jobs]$ qsub -lselect=ncpus=1 -l walltime=200 -- /bin/sleep 500
2.nx15
[saritah@nx15 jobs]$ sleep 4
[saritah@nx15 jobs]$ qsub -lselect=ncpus=2 -l walltime=100 -q expressq -- /bin/sleep 500
3.nx15
[saritah@nx15 jobs]$ qstat -s

nx15:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

0.nx15* saritah workq STDIN 13671 1 4 -- 00:05 S 00:00
Not Running: Not enough free nodes available
1.nx15* saritah workq STDIN 13679 1 1 -- 00:00 R 00:00
Job run at Sun Jun 30 at 23:24 on (nx15:ncpus=1)
2.nx15* saritah workq STDIN 13680 1 1 -- 00:03 R 00:00
Job run at Sun Jun 30 at 23:24 on (nx15:ncpus=1)
3.nx15* saritah expressq STDIN 13678 1 2 -- 00:01 R 00:00
Job run at Sun Jun 30 at 23:24 on (nx15:ncpus=2)
[saritah@nx15 jobs]$ qmgr -c "p sched"

...

create sched default
...
set sched sched_cycle_length = 00:20:00
set sched sched_preempt_enforce_resumption = True
set sched sched_port = 15004
...
[saritah@nx15 jobs]$

qstat -a gives slurm error

Hi all, I followed the install steps and then when I typed in qstat -a I get this error message

perl: error: s_p_parse_file: unable to status file /etc/slurm-llnl/slurm.conf: No such file or directory, retrying in 1sec up to 60sec

I typed in apt install slurm and slurmd but it didn't create this conf file.

Default Job Array Output does not provide seperate files (v18.1.3)

For v18.1.3, submitting job array does not provide an output file for each subjob as it did for v14.1.0 (which we just upgraded from). For instance, job array with two subjobs produces a single output file; for example,
2063[].auslynchpcc04.OU
instead of
2063[1].auslynchpcc04.OU
2063[2].auslynchpcc04.OU

The contents are overwritten by the last subjob to finish.

Is there a configuration change that I am missing?

Server failing to recover queue on restart and having panic shutdown

Server is unable to recover a queue having preempt_targets set as a queue and has a panic shutdown when we try to create this un-recovered queue.

Steps to reproduce:

Create workq2 and workq3
On workq2 set preempt_targets as a queu
set queue workq2 preempt_targets=queue=workq3
Restart PBS
/etc/init.d/pbs restart
PBS fails to recover workq2!
Try to create workq2
create queue workq2 queue_type=e,enabled=t,started=t
The server has a panic shutdown!

pbs_mom core dump in post_reply()

If a sister mom has trouble talking to the pbs_comm, the mom can core dump on EOJ hook completion:

(gdb) bt
 #0  0x000000000046d312 in post_reply (pjob=0x908160, err=0) at /home6/csjohn12/Projects/drtoss19/src/resmom/requests.c:1205
 #1  0x0000000000453737 in reply_hook_bg (pjob=0x908160) at /home6/csjohn12/Projects/drtoss19/src/resmom/mom_hook_func.c:3479
 #2  0x0000000000453969 in mom_process_background_hooks (ptask=0x7a4e00) at /home6/csjohn12/Projects/drtoss19/src/resmom/mom_hook_func.c:3548
 #3  0x00000000004ceaa0 in dispatch_task (ptask=0x7a4e00) at /home6/csjohn12/Projects/drtoss19/src/lib/Libutil/work_task.c:142
 #4  0x00000000004ceda2 in default_next_task () at /home6/csjohn12/Projects/drtoss19/src/lib/Libutil/work_task.c:308
 #5  0x00000000004640d8 in main (argc=1, argv=0x7fffffffed18) at /home6/csjohn12/Projects/drtoss19/src/resmom/mom_main.c:9803
 (gdb) p pjob->ji_hosts
 $5 = (hnodent *) 0x0

Suggested patch that prevents the core dump. However, someone with more knowledge should see how we got here with ji_hosts NULL:

diff --git a/src/resmom/requests.c b/src/resmom/requests.c
index 94a8ee01..f31febdb 100644
--- a/src/resmom/requests.c
+++ b/src/resmom/requests.c
@@ -1201,6 +1201,11 @@ post_reply(job *pjob, int err)
 
        if (pjob->ji_postevent == TM_NULL_EVENT)        /* no event */
                return;
+       if (pjob->ji_hosts == NULL) {           /* No one to talk to */
+               pjob->ji_postevent = TM_NULL_EVENT;
+               pjob->ji_taskid = TM_NULL_TASK;
+               return;
+       }
 
        stream = pjob->ji_hosts[0].hn_stream;   /* MS stream */
        cookie = pjob->ji_wattr[(int)JOB_ATR_Cookie].at_val.at_str;

SmokeTest.test_finished_jobs fails due to permission issue

Taken off master, 52dab14.
Installed pbspro_server.rpm
Installed ptl from source, using sudo pip install -U -r requirements.txt .

As a non-root user, called pbs_benchpress -t SmokeTest.test_finished_jobs

Test fails.
I've attached the test log, but here's a snippet:

2019-08-26 12:04:11,211 INFOCLI2 shecil: which chmod
2019-08-26 12:04:11,215 INFOCLI2 shecil: /usr/bin/chmod 0755 /usr/lib/python2.7/site-packages/ptl/utils/jobs/eatcpu.py
2019-08-26 12:04:11,219 ERROR    err: ['/usr/bin/chmod: changing permissions of \xe2\x80\x98/usr/lib/python2.7/site-packages/ptl/utils/jobs/eatcpu.py\xe2\x80\x99: Operation not permitted']
2019-08-26 12:04:11,219 INFO     job: executable set to /usr/lib/python2.7/site-packages/ptl/utils/jobs/eatcpu.py with arguments: 15
2019-08-26 12:04:11,220 INFOCLI  shecil: sudo -H -u pbsuser /opt/pbs/bin/qsub -l ncpus=2 -- /usr/lib/python2.7/site-packages/ptl/utils/jobs/eatcpu.py 15

I'm not sure we should be changing the permissions of things in /usr/lib to executable. Maybe the job should be python <script.py>?

smoketest.txt

pbs_benchpress -> ptl_test_db.py JSONDecodeError

When running the full test suite, it fails to complete due to corrupted json at different points on debian9. We have seen this in the other distros, but don't have the output. Specifically we are using the debian9 docker image with python 3.5.9 pulled from python.org and compiled with valgrind.

export PYTHON3_MAJOR_VERSION=3.5
export PYTHON3_VERSION=3.5.9
export PBS_BUILD_DIR=/tmp/build-pbspro
export PBS_INSTALL_BASE=/opt
export PBS_INSTALL_DIR=$PBS_INSTALL_BASE/pbs
export PTL_INSTALL_DIR=$PBS_INSTALL_BASE/ptl
export PYTHON3_INSTALL_DIR=$PBS_INSTALL_BASE/python-$PYTHON3_VERSION
export PATH=$PYTHON3_INSTALL_DIR/bin:$PBS_INSTALL_DIR/bin:$PTL_INSTALL_DIR/bin:$PATH
export PYTHONPATH=$PTL_INSTALL_DIR/lib/python$PYTHON3_MAJOR_VERSION/site-packages:$PYTHONPATH
# --- START PYTHON INSTALL ---
cd /tmp
wget https://www.python.org/ftp/python/$PYTHON3_VERSION/Python-$PYTHON3_VERSION.tgz
tar xzf Python-$PYTHON3_VERSION.tgz
cd Python-$PYTHON3_VERSION
./configure --prefix=$PYTHON3_INSTALL_DIR --without-pymalloc --with-pydebug --with-valgrind
make -j 8
sudo make install
sudo $PYTHON3_INSTALL_DIR/bin/pip3 install --upgrade pip
sudo $PYTHON3_INSTALL_DIR/bin/pip3 install nose
sudo $PYTHON3_INSTALL_DIR/bin/pip3 install pexpect
# --- END PYTHON INSTALL ---

time $PTL_INSTALL_DIR/bin/pbs_benchpress --db-name $PBS_BUILD_DIR/test/tests/ptl_test_results.json -f $PTL_INSTALL_DIR/tests -o $PBS_BUILD_DIR/test/tests/ptl_output_all.txt |& tee out.txt

Below is the Traceback:

2020-02-17 15:25:08,701 INFO     expect on server pbsdev-mgmt-ptl-debian9: job_state set 0 job ...  OK
2020-02-17 15:25:08,702 INFO     manager on pbsdev-mgmt-ptl-debian9 as root: set sched default {'scheduling': 'True'}
2020-02-17 15:25:08,703 INFOCLI  pbsdev-mgmt-ptl-debian9: sudo -H -u root /opt/pbs/bin/qmgr -c set sched default scheduling=True
2020-02-17 15:25:08,757 INFOCLI  pbsdev-mgmt-ptl-debian9: /opt/pbs/bin/qmgr -c list sched default
2020-02-17 15:25:08,774 INFO     expect on server pbsdev-mgmt-ptl-debian9: scheduling = True sched default ...  OK
2020-02-17 15:25:08,775 INFO     ====================================
2020-02-17 15:25:08,775 INFO     Completed TestHookSmokeTest tearDown
2020-02-17 15:25:08,776 INFO     ====================================
Traceback (most recent call last):
  File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/case.py", line 134, in run
    self.runTest(result)
  File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/case.py", line 152, in runTest
    test(result)
  File "/opt/python-3.5.9/lib/python3.5/unittest/case.py", line 653, in __call__
    return self.run(*args, **kwds)
  File "/opt/python-3.5.9/lib/python3.5/unittest/case.py", line 621, in run
    result.addSuccess(self)
  File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/proxy.py", line 164, in addSuccess
    self.plugins.addSuccess(self.test)
  File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/plugins/manager.py", line 99, in __call__
    return self.call(*arg, **kw)
  File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/plugins/manager.py", line 167, in simple
    result = meth(*arg, **kw)
  File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1820, in addSuccess
    self.__dbconn.write(self.__create_data(test, None, 'PASS'))
  File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1638, in write
    self.__write_test_data(data['testdata'])
  File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1623, in __write_test_data
    jdata = json.load(self.__dbobj[_TESTRESULT_TN])
  File "/opt/python-3.5.9/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/opt/python-3.5.9/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/opt/python-3.5.9/lib/python3.5/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 7030 column 2 (char 270460)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
    File "/opt/ptl/bin/pbs_benchpress", line 508, in <module>
      nose.main(defaultTest=tests, argv=[sys.argv[0]], plugins=plugins)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/core.py", line 121, in __init__
      **extra_args)
    File "/opt/python-3.5.9/lib/python3.5/unittest/main.py", line 95, in __init__
      self.runTests()
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/core.py", line 207, in runTests
      result = self.testRunner.run(self.test)
    File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_runner.py", line 498, in run
      test(result)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 178, in __call__
      return self.run(*arg, **kw)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 225, in run
      test(orig)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 178, in __call__
      return self.run(*arg, **kw)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 225, in run
      test(orig)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 178, in __call__
      return self.run(*arg, **kw)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 225, in run
      test(orig)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 178, in __call__
      return self.run(*arg, **kw)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/suite.py", line 225, in run
      test(orig)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/case.py", line 46, in __call__
      return self.run(*arg, **kwarg)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/case.py", line 139, in run
      result.addError(self, err)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/proxy.py", line 131, in addError
      plugins.addError(self.test, err)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/plugins/manager.py", line 99, in __call__
      return self.call(*arg, **kw)
    File "/opt/python-3.5.9/lib/python3.5/site-packages/nose/plugins/manager.py", line 167, in simple
      result = meth(*arg, **kw)
    File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1814, in addError
      self.__dbconn.write(self.__create_data(test, err, 'ERROR'))
    File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1638, in write
      self.__write_test_data(data['testdata'])
    File "/opt/ptl/lib/python3.5/site-packages/ptl/utils/plugins/ptl_test_db.py", line 1623, in __write_test_data
      jdata = json.load(self.__dbobj[_TESTRESULT_TN])
    File "/opt/python-3.5.9/lib/python3.5/json/__init__.py", line 268, in load
      parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    File "/opt/python-3.5.9/lib/python3.5/json/__init__.py", line 319, in loads
      return _default_decoder.decode(s)
    File "/opt/python-3.5.9/lib/python3.5/json/decoder.py", line 342, in decode
      raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 7030 column 2 (char 270460)
sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/build-pbspro/test/tests/ptl_test_results.json' mode='w+' encoding='ANSI_X3.4-1968'>

Below are 3 different tails of the json database.

>>>tail -n 16  ptl_test_results.json
      },
      "module": "tests.functional.pbs_accumulate_resc_used",
      "file": "tests/functional/pbs_accumulate_resc_used.py",
      "docstring": "This tests the feature in PBS that enables mom hooks to accumulate resources_used values for resources beside cput, cpupercent, and mem. This includes accumulation of custom resources. The mom hooks supported this feature are: exechost_periodic, execjob_prologue, and execjob_epilogue. PRE: Have a cluster of PBS with 3 mom hosts, with an exechost_startup that adds custom resources. POST: When a job ends, accounting_logs reflect the aggregated resources_used values. And with job_history_enable=true, one can do a 'qstat -x -f <jobid>' to obtain information of a previous job."
    }
  },
  "user": "pershey",
  "test_conf": {},
  "additional_data": {}
}ave a cluster of PBS with 3 mom hosts, with an exechost_startup that adds custom resources. POST: When a job ends, accounting_logs reflect the aggregated resources_used values. And with job_history_enable=true, one can do a 'qstat -x -f <jobid>' to obtain information of a previous job."
    }
  },
  "user": "pershey",
  "test_conf": {},
  "additional_data": {}
}


>>>tail -n 16  ptl_test_results2.json
  "run_id": "1581916219",
  "user": "pershey",
  "product_version": "19.0.0",
  "additional_data": {}
}l_data": {},
  "run_id": "1581916219",
  "user": "pershey",
  "test_conf": {},
  "machine_info": {
    "pbsdev-mgmt-ptl-debian9": {
      "pbs_install_type": "server",
      "platform": "Linux pbsdev-mgmt-ptl-debian9 5.0.0-38-generic #41-Ubuntu SMP Tue Dec 3 00:27:35 UTC 2019 x86_64 ",
      "os_info": "Linux-5.0.0-38-generic-x86_64-with-debian-9.11"
    }
  }
}


>>>tail -n 16  ptl_test_results3.json
  "product_version": "19.0.0",
  "additional_data": {}
}memory_spread_page",
      "test_cgroup_cpuset_ncpus_are_cores",
      "test_submit_job_comp",
      "test_fairshare_formula",
      "test_fairshare_formula2",
      "test_fairshare_formula4",
      "test_fairshare_formula6",
      "test_movejob_hook"
    ],
    "test_end_time": "2020-02-17 15:24:54.933983"
  },
  "test_conf": {},
  "product_version": "19.0.0"
}

The problem seems to be inside ptl_test_db.py in the class JSONDb.

Mom: Double free with multiple exechost_periodic hooks

Process two or more hooks in mom_process_hooks() results in double free of php structure in the post_run_hook(). The problem is that the same php structure is passed to post_run_hook() in run_hook() and the php structure is freed in each of the post_run_hook():

==4767== Invalid read of size 4
==4767== at 0x44E6DB: post_run_hook (mom_hook_func.c:3150)
==4767== by 0x4C7C4D: dispatch_task (work_task.c:142)
==4767== by 0x4C7F26: default_next_task (work_task.c:301)
==4767== by 0x45F039: main (mom_main.c:9739)
==4767== Address 0xa2130d0 is 32 bytes inside a block of size 64 free'd
==4767== at 0x4C29E90: free (vg_replace_malloc.c:473)
==4767== by 0x44E6F0: post_run_hook (mom_hook_func.c:3151)
==4767== by 0x4C7C4D: dispatch_task (work_task.c:142)
==4767== by 0x4C7F26: default_next_task (work_task.c:301)
==4767== by 0x45F039: main (mom_main.c:9739)
==4767==
==4767== Invalid free() / delete / delete[] / realloc()
==4767== at 0x4C29E90: free (vg_replace_malloc.c:473)
==4767== by 0x44E6F0: post_run_hook (mom_hook_func.c:3151)
==4767== by 0x4C7C4D: dispatch_task (work_task.c:142)
==4767== by 0x4C7F26: default_next_task (work_task.c:301)
==4767== by 0x45F039: main (mom_main.c:9739)
==4767== Address 0xa2130b0 is 0 bytes inside a block of size 64 free'd
==4767== at 0x4C29E90: free (vg_replace_malloc.c:473)
==4767== by 0x44E6F0: post_run_hook (mom_hook_func.c:3151)
==4767== by 0x4C7C4D: dispatch_task (work_task.c:142)
==4767== by 0x4C7F26: default_next_task (work_task.c:301)
==4767== by 0x45F039: main (mom_main.c:9739)
==4767==

Spec file does not list libncurses as an installation prerequisite

Add the appropriate libncurses package name for each supported distro. This is commonly named libncurses5.

Fails to ./configure

Hi,

I get the following error when trying to build pbspro on both centos6.10 and centos7.6:

./configure: line 5895: syntax error near unexpected token `shared'
./configure: line 5895: `LT_INIT(shared static)'

This is reproducible by following the provided build instructions at ./INSTALL!

Can someone help me with this?

qusb -V not work

qusb -V not work
echo 'echo $PATH' | qsub -cwd -V
The out file is：
:/usr/bin/

scheduler attribute preempt_sort needs to be revisited

An unset of attribute should get it back to default value; which is not the case with scheduler attribute preempt_sort.
A recent scheduler enhancement updated configuration of sched parameter preempt_sort from 'sched_config' file to qmgr interface; whose change summary is as below:

Old sched_config behavior:
Default:
preempt_sort = min_time_since_start
Non-default:
#preempt_sort = min_time_since_start

New qmgr behavior:
Default:
preempt_sort = min_time_since_start
Non-default: qmgr unset action
<Attribute removed from qmgr "p sched">

Now in order to get this attribute back to default value; we actually need to set it back to "min_time_since_start". By convention, the "unset" of "preempt_sort" should result in default value i.e. min_time_since_start; which is not the case right now.
Right now once we unset "preempt_sort", in order to get it to back to default value, we need to remember the string "min_time_since_start" and set the same.
This is not the case with other scheduler parameters as shown below:

[root@n27 ~]# qmgr -c "print sched"
...
set sched scheduling = True
set sched scheduler_iteration = 600
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = "express_queue, normal_jobs"
set sched preempt_order = SCR
set sched preempt_sort = min_time_since_start
[root@n27 ~]# qmgr -c "unset sched preempt_sort"
[root@n27 ~]# qmgr -c "print sched"
...
set sched scheduling = True
set sched scheduler_iteration = 600
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = express_queue, normal_jobs
set sched preempt_order = SCR
[root@n27 ~]# qmgr -c "unset sched preempt_prio"
[root@n27 ~]# qmgr -c "print sched"
...
set sched scheduling = True
set sched scheduler_iteration = 600
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = express_queue, normal_jobs
set sched preempt_order = SCR
[root@n27 ~]# qmgr -c "u sched preempt_queue_prio"
[root@n27 ~]# qmgr -c "p sched"
...
set sched scheduling = True
set sched scheduler_iteration = 600
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = express_queue, normal_jobs
set sched preempt_order = SCR
[root@n27 ~]# qmgr -c "u sched preempt_order"
[root@n27 ~]# qmgr -c "p sched"
...
set sched scheduling = True
set sched scheduler_iteration = 600
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = express_queue, normal_jobs
set sched preempt_order = SCR
[root@n27 ~]#

Proposed solution:

Remove preempt_sort configuration since it has only one value and have the minimum time since start as default behaviour. Can be added back when we enhance the sorting methods.
Have two values for preempt_sort so that it can be reverted to the default behavior when unset.

Reference:
RFE Discussion - #1131

Missing files in build from source