reframe-hpc / reframe Goto Github PK

View Code? Open in Web Editor NEW

213.0 19.0 99.0 44.59 MB

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.

Home Page: https://reframe-hpc.readthedocs.org

License: BSD 3-Clause "New" or "Revised" License

Shell 0.76% Python 94.51% Makefile 0.01% C 0.01% C++ 1.76% Cuda 2.22% Dockerfile 0.38% Jupyter Notebook 0.35%

regression-testing python3 continuous-integration framework hpc benchmarking validation-testing performance-testing

reframe's Introduction

ReFrame in a Nutshell

ReFrame is a powerful framework for writing system regression tests and benchmarks, specifically targeted to HPC systems. The goal of the framework is to abstract away the complexity of the interactions with the system, separating the logic of a test from the low-level details, which pertain to the system configuration and setup. This allows users to write portable tests in a declarative way that describes only the test's functionality.

Tests in ReFrame are simple Python classes that specify the basic variables and parameters of the test. ReFrame offers an intuitive and very powerful syntax that allows users to create test libraries, test factories, as well as complete test workflows using other tests as fixtures. ReFrame will load the tests and send them down a well-defined pipeline that will execute them in parallel. The stages of this pipeline take care of all the system interaction details, such as programming environment switching, compilation, job submission, job status query, sanity checking and performance assessment.

Please visit the project's documentation page for all the details!

Installation

ReFrame is fairly easy to install. All you need is Python 3.6 or above and to run its bootstrap script:

git clone https://github.com/reframe-hpc/reframe.git
cd reframe
./bootstrap.sh
./bin/reframe -V

If you want a specific release, please refer to the documentation page.

Running the unit tests

You can optionally run the framework's unit tests with the following command:

./test_reframe.py -v

NOTE: Unit tests require a POSIX-compliant C compiler (available through the cc command), as well as the make utility.

Building the documentation locally

You may build the documentation of the master manually as follows:

./bootstrap.sh +docs

For viewing it, you may do the following:

cd docs/html
python3 -m http.server

The documentation is now up on localhost:8000, where you can navigate with your browser.

Test library

The framework comes with an experimental library of tests that users can either run from the command line directly or extend and fine tune them for their systems. See here for more details.

Public test repositories

The ReFrame HPC community Github page provides mirror forks of interesting ReFrame test repositories maintained by various sites or projects. You can use those tests as an additional guidance to implement your own tests.

If you maintain a public test repository and you would like it to be listed in the community page, feel free to open an issue or contact us through Slack.

Contact

You can get in contact with the ReFrame community in the following ways:

Slack

Please join the community's Slack channel for keeping up with the latest news about ReFrame, posting questions and, generally getting in touch with other users and the developers.

Contributing back

ReFrame is an open-source project and we welcome and encourage contributions! Check out our Contribution Guide here.

reframe's People

Stargazers

Watchers

Forkers

vkarak omlins kraushm teojgo havogt rsarm bcfriesen victorusu lucamar rafa-esco lxavier boegel bear-rsg ajocksch sekelle jgphpc itkovian zqyou emuhedo aurianer henkela computingdolas cosunae giuseppelore jfavre ekouts ddeeptimahanti pramodk samiilvonen ikirker smoors kmanalo ncar hpc-unibe-ch julianocristian shahzebsiddiqui mboisson franciscrickinstitute sleak-lbl bluebrain difey henrique maximevdb sabo jjotero christopherbignamini jonathanfrawley stevenvdb christiankniep akesandgren toxa81 enccs samcom12 giordano casparvl internetunexplorer hurricane642 jkalias homerdin lorienlv mahendrapaipuri sturmianseq rngoodner eshnil2000 srinathv twrobinson fl-cfd clip-hpc durhamarc jonshelley snes-chalmers lumi-supercomputer jacwah synscc maldil felilxtomski researchapps umeshpp skyv04 ethanjjjjjjj donion74 jack-morrison lewih madhureddyr brandonbiggs uelideschwert fenoyc adi611 tmjbios oliverperks dmargala crivella foobarquaxx zenwang00 createyourpersonalaccount

reframe's Issues

Automatic build of the master documentation

The current online documentation refers to the latest release. We should perhaps think of updating it automatically upon commits on master.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/478

Job's environment is set up twice

Requestor: @jgphpc
Internal issue: https://madra.cscs.ch/scs/reframe/issues/249

On dom, I get in perftools_check_c.err

perftools-base/6.4.3(43):ERROR:150: Module 'perftools-base/6.4.3' conflicts with 
the currently loaded module(s) 'perftools-lite'
perftools-base/6.4.3(43):ERROR:102: Tcl command execution failed: conflict perftools-lite

In /scratch/snx2000tds/piccinal/stage/gpu/perftools_check_c/PrgEnv-gnu/perftools_check_c_daint_gpu_PrgEnv-gnu.sh

module load daint-gpu
module unload PrgEnv-cray
module load PrgEnv-gnu
module load perftools-base
module load perftools-lite

Easy to reproduce:

module load perftools-base
module load perftools-lite
module load perftools-base

Does reframe purge the list of loaded modulefiles before submitting the slurm job ?

Import CSCS settings file

Performance numbers and units

It may be a good idea to add the performance number units on the performance logs.
Since it would help in the understanding of the logs.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/469

Module loading/unloading appears not to work

Hi,

I have some tests I would like to run with only the Cray compiler, via the PrgEnv-cray programming environment. My settings include:

                'PrgEnv-cray': {
                    'type': 'ProgEnvironment',
                    'modules': ['PrgEnv-cray'],
                    'cc':  'cc',
                    'cxx': 'CC',
                    'ftn': 'ftn',

and in the test script I have

        self.valid_prog_environs = ['PrgEnv-cray']

However, when I attempt to run the test with the PrgEnv-intel module loaded, ReFrame appears to make no attempt to swap or unload any modules before compiling the code (which then fails). If I manually switch to PrgEnv-cray and then invoke the test, it compiles and runs. (Of course this approach is impractical if running a series of tests which require different programming environments.)

The documentation claims that ReFrame is aware of conflicting modules and unloads/swaps them as necessary in order to achieve the desired programming environment. Any ideas what I am doing wrong such that this feature seems not to be active?

Thanks in advance.

Import framework issues from the internal repository

Support for different compilation and running environments

Some tests may require a different environment for running than for compiling. Additionally, this environment may be conflicting with the one used for compilation, so it cannot be included in self.modules. The only solution currently is to add an explicit module load command in self.pre_run, but this is not so portable and looks like a workaround. A solution would be to add two more RegressionTest attributes that will control the run environment, e.g., self.run_modules and self.run_variables. By default, these will be the same as of the compilation environment, if not specified.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/564

Unnecessary 'module unload' commands in the generated job script

Here is an example output for the PrgEnv-pgi environment:

#!/bin/bash -l
#SBATCH --job-name="rrtmgp_check_daint_gpu_PrgEnv-pgi"
#SBATCH --time=0:10:0
#SBATCH --ntasks=1
#SBATCH --partition=debug
#SBATCH --exclusive
#SBATCH --output=/scratch/snx3000/karakasv/reframe-tests/stage/gpu/rrtmgp_check/PrgEnv-pgi/rrtmgp_check.out
#SBATCH --error=/scratch/snx3000/karakasv/reframe-tests/stage/gpu/rrtmgp_check/PrgEnv-pgi/rrtmgp_check.err
#SBATCH --constraint=gpu
module load daint-gpu
module unload PrgEnv-cray
module load PrgEnv-pgi
module unload PrgEnv-cray
module load craype-accel-nvidia60
module unload PrgEnv-cray
module load cray-netcdf
module unload PrgEnv-cray
module unload pgi
module load pgi/17.7.0

I think the problem comes from the way we record the conflicts.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/565

Deep copying regression tests imposes unnecessary restrictions

We should rethink the fact that we deep copy the whole regression test instances for each test case (environment, system, partition). It is definitely necessary for isolating them and unties our hands in the asynchronous execution policy, but it imposes some restrictions on how we should write regression tests. Anything that cannot be deep-copied (e.g., I/O streams, generators etc.) it cannot be used in a regression test. For example, the following although it's pretty much good valid Python, we do not support it:

        self.sanity_patterns = sn.all(sn.assert_found(r'sth', fname)
                                      for fname in sn.iglob('file*.txt'))

The problem is that the argument to sn.all() is a generator and cannot be deep-copied. Instead, we should write this using a list comprehension:

        self.sanity_patterns = sn.all([sn.assert_found(r'sth', fname)
                                       for fname in sn.iglob('file*.txt')])

Internal issue: https://madra.cscs.ch/scs/reframe/issues/524

Consider using os.walk() for recursive descent into directories

Documentation is here.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/379

Give some context to the positional arguments of some assert functions

This has come up while working with @jgphpc on some of his tests. It would be nice if we give a context to the arguments of some assert functions that are otherwise equivalent. For example, in assert_eq(a, b) we treat a and b equivalently, so we cannot print anything better than 1 != 2, if a == 1 and b == 2. If we instead, always assumed that the first argument is the expected value, whereas the second is the found one, we could have better message like this: expected 1, found 2. We could do that also for the inequality assertions: expected <= 1, found 2.

This of course implies that the users pass the arguments in a meaningful order (either way, the result of the assertion would be correct, if the operation is commutative, it's only the message that is affected). Btw, this also the convention that Google Test follows with its EXPECTS_* functions.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/421

Launch ReFrame on a remote machine

It would be very convenient if I could run a similar command on any machine:

$ ./bin/reframe -c reframe_tests/ --system=<remote system> -r

and then ReFrame would automagically copy the various code into the specified remote system and run the regression there.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/229

RFE: allow additional headers in Slurm batch scripts besides "#SBATCH"

The "Cori" system at NERSC has a burst buffer which interfaces with Slurm via the #DW batch script headers. Consequently, jobs which use the burst buffer have a mix of #SBATCH and #DW headers, e.g.,

#!/bin/bash

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 01:00:00
#DW jobdw capacity=10GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
#DW stage_out source=$DW_JOB_STRIPED/dirname destination=/global/cscratch1/sd/username/path/to/dirname type=directory

srun ./hello.ex

Other examples of burst buffer scripts of varying complexity are provided here.

Would it be possible to allow a mix of such headers in the batch scripts that ReFrame generates?

Support for advanced selection criteria for tests

Currently, the selection criteria (tags and test names) can only by ANDed. We should add full flexibility for selecting tests. For example:

I want tests with tags either production or maintenance
I want all tests that their name starts with openfoam and their tag does not start with production.

Another requirement is that a subsequent selection expression should supersede any previously specified in the command-line. This should be useful with the execution modes.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/344

Add a reference guide for the command-line interface

It is good to have everything gathered somewhere.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/285

Unit test failure of the TestAsynchronousExecutionPolicy

The failure happens when running the unittest suite on a local machine (laptop) with 2 Codes - 4 Threads. The failure is the following:

FAIL: test_kbd_interrupt_in_wait_with_limited_concurrency (unittests.test_policies.TestAsynchronousExecutionPolicy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/manitart/reframe_dev_local/unittests/test_policies.py", line 310, in test_kbd_interrupt_in_wait_with_limited_concurrency
    self._run_checks(checks, 2)
  File "/home/manitart/reframe_dev_local/unittests/test_policies.py", line 289, in _run_checks
    self.assertEqual(4, self.runner.stats.num_cases())
AssertionError: 4 != 3

Internal issue: https://madra.cscs.ch/scs/reframe/issues/582

Enforce unique regression test names

Internal issue: https://madra.cscs.ch/scs/reframe/issues/300

Improve attribute validation

Currently, we can only define aggregate types of builtin types. For example, it is not possible to create a "list of alphanumeric fields". The solution would be to separate the type information from the field. Here's a proof-of-concept:

import re

# Type generators
class Type:
    def __init__(self, typespec):
        self.typespec = typespec


    def check(self, value):
        if not isinstance(value, self.typespec):
            raise TypeError


class SequenceType(Type):
    def __init__(self, container_type, elem_type):
        self.container_type = container_type
        self.elem_type = elem_type

    def check(self, container):
        if not isinstance(container, self.container_type):
            raise TypeError

        for e in container:
            print(e, self.elem_type)
            self.elem_type.check(e)

# class MapType(Type):
#    pass


# Concrete types
class AlphanumericType:
    base_type = Type(str)

    def check(value):
        AlphanumericType.base_type.check(value)
        if not re.fullmatch('\w*', value, re.ASCII):
            raise TypeError



class Field(object):
    def __init__(self, fieldname, fieldtype):
        self.name = fieldname
        self.type = fieldtype


    def __get__(self, obj, objtype):
        return obj.__dict__[self.name]


    def __set__(self, obj, value):
        self.type.check(value)
        obj.__dict__[self.name] = value


class C:
    v = Field('v', SequenceType(list, AlphanumericType))


if __name__ == '__main__':
    c = C()
    c.v = [ 'foo', 'bar', 'foo/bar' ]

The above program will raise a TypeError. The whole logic of the type system is separate from the fields and we only need a generic field that carries on its name and its type. The type verification happens in the type system. Notice that concrete types can be simple static classes.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/244

System and partition initialisation are random

This is due to the fact that both systems and partitions are defined as dictionaries. This has several repercussions:

No specific order that the regression goes through the partitions during the phase of test execution.
Hostname matching rules must be unique. No last resort, catch-all pattern may be placed, since this may be tried earlier and lead to false system detection.

With Python 3.7 normal dictionaries will become ordered, but it's better not to rely on that.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/234

Remove pre_run/post_run from Job

They are deprecated in 2.10, should be completely removed in 2.12.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/577

Revisit the way frontend options are passed to the scheduler backends

Currently, this is controlled by the sched_* arguments plus the sched_options. The problem with this approach is bugs like this. I think the best way to pass everything through sched_options or options and each back end should map specific properties to options. This is a rough idea, we should work on it a bit.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/543

The job script generators change too late to the working directory

Quoting from @bcfriesen:

Actually, now there is a new issue related to this.

When using DataWarp, the scheduler creates a new temporary directory on the DataWarp file system which is pointed to by an environment variable. For "striped" access, the environment variable is $DW_JOB_STRIPED. The data specified by the #DW stage_in header in the batch script is automatically copied to $DW_JOB_STRIPED. The remainder of the batch script typically involves copying the executable into $DW_JOB_STRIPED, because the value of that environment variable changes with every run (so it can't be hard-coded into the batch script, or into the code itself).

I tried to address this by writing a simple bash script which copies the executable into $DW_JOB_STRIPED:

EXE=a.out
cp $EXE $DW_JOB_STRIPED
cd $DW_JOB_STRIPED
and putting it into the self.job.pre_run stage as described here. The issue is that the auto-generated Slurm script will execute that bash script, but then follow it immediately with cd /path/to/stagedir, as specified by the stagedir in the _site_configuration. So the test never actually tests the functionality of DataWarp because it reads everything from stagedir, which is not on the DataWarp file system.

Yes, this is indeed a problem of our job script generator. The cd to the working directory should be done at the very beginning or we should try using the --workdir option of Slurm instead.

ReFrame exits immediately if interrupted while waiting for the compilation command to finish

This is present both on the current master and on the new execution policies. We should investigate whether it is a problem of the framework, or simply the subprocess.run() aborts immediately and exits.

Example output is here:

[14:59:24] karakasv@daint103 reframe [master] $ ./reframe.py --nodelist=nid00012 -c checks/microbenchmarks/stream/stream.py --exec-policy=serial -r
Command line: ./reframe.py --nodelist=nid00012 -c checks/microbenchmarks/stream/stream.py --exec-policy=serial -r
Reframe version: 2.9
Launched by user: karakasv
Launched on host: daint103
Reframe paths
=============
    Check prefix      :
    Check search path : 'checks/microbenchmarks/stream/stream.py'
    Stage dir prefix  : /users/karakasv/Devel/reframe/stage/
    Output dir prefix : /users/karakasv/Devel/reframe/output/
    Logging dir       : /users/karakasv/Devel/reframe/logs
[==========] Running 1 check(s)
[==========] Started on Fri Feb  2 14:59:29 2018

[----------] started processing stream_benchmark (STREAM Benchmark)
[ RUN      ] stream_benchmark on daint:gpu using PrgEnv-cray
[     FAIL ] stream_benchmark on daint:gpu using PrgEnv-cray
[ RUN      ] stream_benchmark on daint:gpu using PrgEnv-gnu
[     FAIL ] stream_benchmark on daint:gpu using PrgEnv-gnu
[ RUN      ] stream_benchmark on daint:gpu using PrgEnv-intel
^CTerminated

And the last lines of the log file are the following:

[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: executing OS command: modulecmd python show PrgEnv-intel
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: executing OS command: modulecmd python unload PrgEnv-cray
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: executing OS command: modulecmd python load PrgEnv-intel
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: setting up paths
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: setting up the job descriptor
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: job scheduler backend: local
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: setting up performance logging
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: entering compilation stage
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: sourcepath: /users/karakasv/Devel/reframe/checks/microbenchmarks/stream/src/stream.c
[2018-02-02T14:59:33] debug: stream_benchmark@daint:gpu using PrgEnv-intel: executing OS command: cc   -qopenmp -O3 -I/users/karakasv/Devel/reframe/checks/microbenchmarks/stream/src /users/karakasv/Devel/reframe/checks/microbenchmarks/stream/src/stream.c -o /users/karakasv/Devel/reframe/stage/gpu/stream_benchmark/PrgEnv-intel/./stream_benchmark

Internal issue: https://madra.cscs.ch/scs/reframe/issues/563

Inconsistent use of "private" fields in subclasses

Subclasses of RegressionTest are not allowed to use private fields for obvious reasons. However, in several other places of the framework private fields (i.e. fields starting with _) have more the meaning of protected fields. We should figure this out and change the code accordingly. An option could be to treat fields starting with _ as protected and fields starting with __ are really private (which is actually true).

In the following, the _protected attribute can be accessed by subclasses, but the __private one is not accessible and an AttributeError is raised (technically, it can be accessed, but it's hacky and implementation dependent).

class C:
    def __init__(self):
        self._protected = 1
        self.__private = 2

class D(C):
    def __init__(self):
        super().__init__()
        self.a = self._protected
        self.b = self.__private

d = D()

This results to the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in __init__
AttributeError: 'D' object has no attribute '_D__private'

Internal issue: https://madra.cscs.ch/scs/reframe/issues/474

Possible problem with scoped dictionaries

The following code tries to generate a scoped dictionary for performance checks in two ways. The second one is not possible (see the output below).

        _kesch_cn = {}
        for device in range(0,8):
            _kesch_cn['perf_h2d_%i' % device] = (7213, -0.05, None)
            _kesch_cn['perf_d2h_%i' % device] = (7213, -0.05, None)
            _kesch_cn['perf_d2d_%i' % device] = (137347, -0.05, None)
        self.reference = {
            'kesch:cn' : _kesch_cn
        }
        print(self.reference)
        self.reference = {}
        self.reference['kesch:cn'] = {}
        self.reference['kesch:cn']['perf_xy'] = (0, 1, 2)
        print(self.reference)

{'kesch:cn': {'perf_h2d_0': (7213, -0.05, None), 'perf_d2h_0': (7213, -0.05, None), 'perf_d2d_0': (137347, -0.05, None), 'perf_h2d_1': (7213, -0.05, None), 'perf_d2h_1': (7213, -0.05, None), 'perf_d2d_1': (137347, -0.05, None), 'perf_h2d_2': (7213, -0.05, None), 'perf_d2h_2': (7213, -0.05, None), 'perf_d2d_2': (137347, -0.05, None), 'perf_h2d_3': (7213, -0.05, None), 'perf_d2h_3': (7213, -0.05, None), 'perf_d2d_3': (137347, -0.05, None), 'perf_h2d_4': (7213, -0.05, None), 'perf_d2h_4': (7213, -0.05, None), 'perf_d2d_4': (137347, -0.05, None), 'perf_h2d_5': (7213, -0.05, None), 'perf_d2h_5': (7213, -0.05, None), 'perf_d2d_5': (137347, -0.05, None), 'perf_h2d_6': (7213, -0.05, None), 'perf_d2h_6': (7213, -0.05, None), 'perf_d2d_6': (137347, -0.05, None), 'perf_h2d_7': (7213, -0.05, None), 'perf_d2h_7': (7213, -0.05, None), 'perf_d2d_7': (137347, -0.05, None)}}
{'kesch': {'cn': {'perf_xy': (0, 1, 2)}}}

Internal issue: https://madra.cscs.ch/scs/reframe/issues/567

Establish a terminology for words like "test", "check", "job", "case" and "testcase" within ReFrame and its documentation

Many of these words are used pretty much interchangeably what reduces the clarity of the code and the documentation. So, we should define a clear meaning to each of these words within ReFrame and its documentation and use only as few different words as minimally required (BTW in the documentation, this kind of definitions could go into a glossary if at some point we have a considerable amount of such).

Here are some notes from Vasilis on the subject:

"test" and "check" are indeed used interchangeably. The origin of the term "check" has though some historic background. When I started implementing the framework and describing/explaining it to people, I wanted to differentiate the unit "tests" of the framework itself from the regression "checks" written with the framework. That's why I came up with the word "check", which, I admit, I didn't find very attractive from the beginning.
The "job" is a separate concept from the the test case. A "job" refers to the job instance that is associated with the test. The "test case", on the other hand, is essentially the tuple "regression test, current partition, current environment".

Please note here your thoughts on this subject.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/422

Command line option for trying different module versions

Sysadmins usually come up with requests, like, does the regression works with the new (non-default) version of sth? Currently, there is no easy way to test this. One would need to modify the affected user checks or change the default module.

This issue proposes the addition of a new command line option, e.g., --try-module modname=new_modname that would load new_modname instead of modname. For example, if we need to try a new version of NAMD, say NAMD/2.12, we would do it easily as following:

./run_regression.py --try-module NAMD=NAMD/2.12 -n namd_gpu_check

Internal issue: https://madra.cscs.ch/scs/reframe/issues/158

Consider using the delegation pattern in the Job scheduler internal interface

And perhaps in other places, too.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/489

Function `assert_reference` is too restrictive

There are cases where the expected values differ more than 100% from the reference one. Currently, the limit values have to be between 0.0 and 1.0 thus restricting use of assert_reference in the cases mentioned above.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/575

Support for PBS submission system

Playing a bit around with Trque, one thing that initially bothered my was that there is now equivalent to sacct that we use for reliable job monitoring with Slurm. There is a tracejob command which provides the job history, but this is meant for admin usage only. However, there is a good thing with Torque that may allow us to have reliable job monitoring using the qstat command (the equivalent of Slurm's squeue). The job's standard output/error files are not created immediately in the directory the job was launched in, but rather they are copied there after the job has finished. As a matter of fact there is even a special command (qpeek) for looking at the output of a running job. So this feature (of lazy creation of job output) could be used as the regression's synchronisation point.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/215

Explain more the scoped dictionaries in the documentation

They can be quite useful also when writing tests, so we should also present some advanced examples.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/286

Add an environment type that would recognise easybuild files

I think that this is a nice feature to have. My suggestion is to automatically trigger an Easybuild build if a source file ends with .eb. In such case, we should perhaps automatically load the generated module, which we will infer from the easyconfig name. There might be other implications that are not clear to me now, but feel free to contribute in the discussion.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/486

The CI has to recognize machines to retry from the PR comments

Expose a full read-only view of System to user checks

Currently, the system-level directories are writeable, because the CLI must be able to set them. As a result, a user check may also set them. This does not affect the rest of the checks, since each check runs isolated. However, we want to disallow this. We can achieve this by creating an adapter layer above System that will expose a read-only view to the RegressionTest.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/393

Run commands locally for run-only tests

It would be useful to run commands also locally, i.e. on the machine where ReFrame is executed.

In my specific case, this new feature would be useful to clone a git repository during the check's setup phase. Please note that the compute nodes cannot perform a "git clone" operation due to network restrictions, hence I need to do it on the login node.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/277

Support for multiple steps in job scripts

Internal issue: https://madra.cscs.ch/scs/reframe/issues/127

This will allow a better implementation of the HDF5 check.

Things to consider:

Different launchers? Ability to run local commands before or after a job step.
Pass different options to the launcher?

Problem handling the decaying polling rate in the asynchronous execution policy

An error such as the following can appear:

./bin/reframe: unexpected error: math domain error
Traceback (most recent call last):
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/cli.py", line 413, in main
    runner.runall(checks_matched, system)
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/executors/__init__.py", line 176, in runall
    self._runall(checks, system)
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/executors/__init__.py", line 246, in _runall
    self._policy.exit()
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/executors/policies.py", line 288, in exit
    desired_rate = pollrate(t_elapsed, real_rate)
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/executors/policies.py", line 88, in __call__
    self._init_poll_fn(init_rate)
  File "/scratch/shared/leone/jenscscs/reframe-ci-leone-5d81168/reframe/frontend/executors/policies.py", line 82, in _init_poll_fn
    self._c = math.log(self._a / (self._thres*self._b)) / self._decay
ValueError: math domain error

Problem can be reproduced on Leone.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/580

Examine alternatives for avoiding the boilerplate code in regression tests

Perhaps by using a class decorator.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/402

Some fields of `RegressionTest` are too restrictive

For example, the variables field. We now require it to be of type dict[str, str], but in fact any type of values could be used, since the str() is called internally. This would allow us to use the following syntax:

self.variables = {
    'OMP_NUM_THREADS': self.num_cpus_per_task
}

instead of the current one:

self.variables = {
    'OMP_NUM_THREADS': str(self.num_cpus_per_task)
}

Internal issue: https://madra.cscs.ch/scs/reframe/issues/562

Publish CSCS regression tests

Import relevant developer's documentation from the internal repository

Enable Jenkins CI on this repository

The solution should test the same aspects as in the internal repository.
This CI should be triggered for PRs automatically and upon pushes on a PR.
The CI may be triggered also manually by referring to the @jenkins-cscs user as follows:
- "@jenkins-cscs retry" -> Should trigger the CI on all the supported internal systems.
- "@jenkins-cscs retry [sysname] ..." -> Retest this PR on the specified systems.

Convert common variables that may vary depending on the system to scoped dictionaries

Examples are

num_tasks and the like
modules
valid_prog_environs, etc.

This will allow to different configurations of a test depending on the system it runs and/or the partition without the need to use ifs. The following will use 16 nodes if the check is run on Dom (any partition) and 64 if it is run on the Daint gpu partition; for all the rest of the case, it will use a single task:

self.num_tasks = {
    'dom' : 16,
    'daint:gpu' : 64,
    '*' : 1
}

The standard syntax self.num_tasks = 16 must also be supported, which will be equivalent to self.num_tasks = { '*' : 16 }.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/202

Support for loading custom settings file

We need a more flexible way to support site configurations and the framework's settings.

This will facilitate also front-end unit tests, since we can provide the settings files as unit test fixtures. It will also allow us to unify the public and private repositories.

Internal issue: https://madra.cscs.ch/scs/reframe/issues/233

need an option to run without "srun"

Visit uses the name of the scheduler "srun", as an option to its own scheduler. Thus, instead of running as a typical simulation such as:

srun visit [options]

I run it in the following way:

visit [options] -l srun

currently, I did not find a way to accomplish this with reframe. @vkarak has seen the demonstration

Module mapping file and corresponding command line option

A way should be defined for specifying module mappings. The most straightforward is to have files, that the user could specify in the command line, in the following form:

gromacs GROMACS/2016.1
amber Amber/16

Internal issue: https://madra.cscs.ch/scs/reframe/issues/552

./reframe.py -n hello_world_mpi_openmp_dynamic_cpp -r
Command line: ./reframe.py -n hello_world_mpi_openmp_dynamic_cpp -r
Reframe version: 2.10
Launched by user: hvictor
Launched on host: keschln-0002
Reframe paths
=============
    Check prefix      : /scratch-shared/meteoswiss/scratch/hvictor/reframe
(R) Check search path : 'checks/'
    Stage dir prefix  : /scratch-shared/meteoswiss/scratch/hvictor/reframe/stage/
    Output dir prefix : /scratch-shared/meteoswiss/scratch/hvictor/reframe/output/
    Logging dir       : /scratch-shared/meteoswiss/scratch/hvictor/reframe/logs
[==========] Running 1 check(s)
[==========] Started on Thu Feb  8 13:20:39 2018 

[----------] started processing hello_world_mpi_openmp_dynamic_cpp (C++ Hello World MPI + OpenMP Dynamic)
[ RUN      ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-gnu
[       OK ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-gnu
[ RUN      ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-cray
[       OK ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-cray
[ RUN      ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-gnu-gdr
[     FAIL ] hello_world_mpi_openmp_dynamic_cpp on kesch:pn using PrgEnv-gnu-gdr
[ RUN      ] hello_world_mpi_openmp_dynamic_cpp on kesch:cn using PrgEnv-gnu
[       OK ] hello_world_mpi_openmp_dynamic_cpp on kesch:cn using PrgEnv-gnu
[ RUN      ] hello_world_mpi_openmp_dynamic_cpp on kesch:cn using PrgEnv-cray
[       OK ] hello_world_mpi_openmp_dynamic_cpp on kesch:cn using PrgEnv-cray
[----------] finished processing hello_world_mpi_openmp_dynamic_cpp (C++ Hello World MPI + OpenMP Dynamic)

[  FAILED  ] Ran 5 test case(s) from 1 check(s) (1 failure(s))
[==========] Finished on Thu Feb  8 13:20:54 2018 
==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for hello_world_mpi_openmp_dynamic_cpp
  * System partition: kesch:pn
  * Environment: PrgEnv-gnu-gdr
  * Stage directory: /scratch-shared/meteoswiss/scratch/hvictor/reframe/stage/pn/hello_world_mpi_openmp_dynamic_cpp/PrgEnv-gnu-gdr
  * Job type: batch job (id=None)
  * Maintainers: ['CB', 'VK']
  * Failing phase: compile
  * Reason: caught framework exception: : 'PrgEnv-gnu-gdr'

Internal issue: https://madra.cscs.ch/scs/reframe/issues/574

Support for sending performance logs to Graylog

Internal issue: https://madra.cscs.ch/scs/reframe/issues/196