Giter Site home page Giter Site logo

pmix-tests's Introduction

OpenPMIx

Build Status

Official documentation

The OpenPMIx documentation can be viewed in the following ways:

  1. Online at https://docs.openpmix.org/
  2. In self-contained (i.e., suitable for local viewing, without an internet connection) in official distribution tarballs under docs/_build/html/index.html.

Building the documentation locally

The source code for OpenPMIx's docs can be found in the OpenPMIx Git repository under the docs folder.

Developers who clone the OpenPMIx Git repository will not have the HTML documentation and man pages by default; it must be built. Instructions for how to build the OpenPMIx documentation can be found here: https://docs.openpmix.org/en/latest/developers/prerequisites.html#sphinx-and-therefore-python.

Security policy

The OpenPMIx security policy can be viewed online at https://docs.openpmix.org/en/latest/security.html.

NOTE: any potential security issue should be reported immediately to us at [email protected]

pmix-tests's People

Contributors

abouteiller avatar awlauria avatar cpshereda avatar drwootton avatar hppritcha avatar jjhursey avatar naughtont3 avatar rhc54 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pmix-tests's Issues

PMIx Fence: single-job wildcard barrier

Test description

Verifies that the Fence is synchronizing

Test sketch

#include "pmix.h"
double max_fence_time()
{
	double fence_time = 0;
	int i;
	
	/* Measure the typical fence execution time */
	for(i = 0; i < 100; i++) {
		ts1 = timestamp();
		PMIx_Fence(without_data_collection);
		ts2 = timestamp();
		fence_time = max(fence_time, ts2 - ts1);
	}
	return fence_time;
}

int main() 
{
    double timeout, fence_time;
	
    PMIx_Init();
	
    fence_time = max_fence_time();
    T = Ratio * fence_time; // Ratio might be 100, should be selected for the particular system

    PMIx_Fence(without_data_collection);

    if( rank == 0){
        sleep(T);
    }
    ts1 = timestamp();
    PMIx_Fence(without_data_collection);
    ts2 = timestamp();
    if( rank == 0 ){
        assert((t2 - t1) ~ fence_time);
    } else {
        assert((t2 - t1) ~ T);
    }
    PMIx_Finalize();
}

Execution details

  • 4 servers
  • 16 clients
  • Predefined (passed through cmdline) namespace
  • Predefined process placement: "0:0,1,2,3; 1:4,5,6,7; 2:8,9,10,11; 3:12,13,14,15;"
  • Ratio and "~" are selected to match the system
    • The time-dependant checks can be turned off
  • Execute M times to capture race conditions
  • The first rank is simulating the delay. The test verifies that the Fence is really synchronizing;

Client-side expectations:

  1. All PMIx calls return PMIX_SUCCESS
  2. All ranks (except rank=0) observe > T of Fence execution.

Server-side expectations:

  1. N invocations of:
  • client_connected
  • client_finalized
  1. Verify, that proc structure was set to the individual ranks.
  2. 2 Fence callback invocation with WILDCARD.
  3. Distance between Fence's on node0 is > T
  4. Starting from "openpmix/openpmix#1135" the size of Fence should be 0B.
  5. No other callbacks are called (no direct modex requests)
  6. (? Any event-related activity?)

Reference implementation:

TBD

PMIx job info: other ranks positions

Test description

Verification of the information about other ranks position within the job.

Test sketch

int main() {
    PMIx_Init();
    Get(&this_proc,  [PMIX_LOCAL_RANK, PMIX_NODE_RANK, PMIX_NODEID])
    local_peers = get_local_ranks()
    for(P in local_peers):
        Get(P,  [PMIX_LOCAL_RANK, PMIX_NODE_RANK, PMIX_NODEID, PMIX_HOSTNAME])
        verify_against_side_channel_data();
    remote_peers = { all_ranks - local_peers }
    for(P in remote_peers):
        Get(P,  [PMIX_LOCAL_RANK, PMIX_NODE_RANK, PMIX_NODEID, PMIX_HOSTNAME])
        verify_against_side_channel_data();
    PMIx_Finalize();
}

For get_local_ranks() implementation, see [example].

Client-side expectations

  • [side channel check] PMIx provides expected:
    • Job size and local size
    • LOCAL and NODE ranks
    • NODEID

Server-side expectations:

  • All clients connect and disconnect
  • All client processes are successfully terminated
  • No requests (i.e. direct modex or notifications) are observed.

Reference implementation

openpmix/openpmix#1943

Possible data races without locking

Hi, it seems the lock_handle should be protected by locks?

static int lock_handle;

void lock_stream() {
if (0 < lock_handle) {
flock(lock_handle, LOCK_EX);
}
pthread_mutex_lock(&thread_lock);
}
void unlock_stream() {
pthread_mutex_unlock(&thread_lock);
if (0 < lock_handle) {
flock(lock_handle, LOCK_UN);
}
}

Just like this.

 void lock_stream() { 
     pthread_mutex_lock(&thread_lock); 
     if (0 < lock_handle) { 
         flock(lock_handle, LOCK_EX); 
     } 
 } 
  
 void unlock_stream() { 
     if (0 < lock_handle) { 
         flock(lock_handle, LOCK_UN); 
     } 
     pthread_mutex_unlock(&thread_lock); 
 } 

PMIx Fence: single-job wildcard barrier WITH timeout

Test description

Verifies that the Fence is synchronizing

Test sketch

#include "pmix.h"

double max_fence_time()
{
	double fence_time = 0;
	int i;
	
	/* Measure the typical fence execution time */
	for(i = 0; i < 100; i++) {
		ts1 = timestamp();
		PMIx_Fence(without_data_collection);
		ts2 = timestamp();
		fence_time = max(fence_time, ts2 - ts1);
	}
	return fence_time;
}

int main() {
    double timeout, fence_time;
	
    PMIx_Init();
	
    fence_time = max_fence_time();
    T = Ratio * fence_time; // Ratio might be 100, should be selected for the particular system    
   
    PMIx_Fence(without_data_collection);
    if( rank == 0){
        sleep(T);
    }
    ts1 = timestamp();
    rc = PMIx_Fence(without_data_collection, timeout = T/2);
    ts2 = timestamp();
    assert(rc == PMIX_ERR_TIMEOUT);
    assert((t2 - t1) ~ (T/2));
    PMIx_Finalize();
}

Execution details

  • 4 servers
  • 16 clients
  • Predefined (passed through cmdline) namespace
  • Predefined process placement: "0:0,1,2,3; 1:4,5,6,7; 2:8,9,10,11; 3:12,13,14,15;"
  • Ratio and "~" are selected to match the system
    • The time-dependant checks can be turned off
  • Execute M times to capture race conditions
  • The first rank is simulating the delay. The test verifies that the Fence is really synchronizing;

Client-side expectations:

  1. All PMIx calls return PMIX_SUCCESS
  2. All ranks (except rank=0) experience Fence timeout.

Server-side expectations:

  1. N invocations of:
  • client_connected
  • client_finalized
  1. Verify, that proc structure was set to the individual ranks.
  2. 2 Fence callback invocation with WILDCARD.
  3. Distance between Fence's on node0 is > T
  4. Starting from "openpmix/openpmix#1135" the size of Fence should be 0B.
  5. No other callbacks are called (no direct modex requests)
  6. The timeout should be observed and the RTE server has to act accordingly informing the PMIx server about it.
    (? Any event-related activity?)

Reference implementation:

TBD

Notes

The test suite's RTE component should implement the support of the PMIX_TIMEOUT info key in pmix_server_fencenb_fn_t callback.
Currently, it's not there.

Re-enable the debugger PRRTE tests

We need to get the PRRTE debugger tests running again. They hit a signature issue last week and they had to be disabled.

To do:

  • Fix baseline signatures in the pmix-tests branch
  • Fix tests in PRRTE examples/debugger, as needed

Discussion context:

PMIx Hello-world test

Test description:
Verify the very basic functionality: initialization and finalization.

Client-side expectations:

  • [side channel] PMIx provides expected namespace and rank
  • Both PMIx_Init and PMIx_Finalize complete with PMIX_SUCCESS

Server-side expectations:

  • All clients connect and disconnect
  • All client processes are successfully terminated
  • No requests (i.e. direct modex or notifications) are observed.

Reference implementation:
openpmix/openpmix#1820

Add a 'make distcheck' test to CI

Occasionally OpenPMIx/PRRTE hit build system issues that only show up when making a distribution tarball. It has been suggested to me in the past that the community add a "special build" of OpenPMIx and PRRTE that checks for this scenario.

Suggested process:

  • Build PRRTE and/or OpenPMix release tarball
  • Extract the tarball - so we are only building with what is packaged
  • Build it a few different ways. For example, with and without VPATH
  • Run a few simple runtime tests, such as 'make check'

We have a "special builds" CI already active, so this would be adding a new directory and test script to the following directory in pmix-tests (CI will pick it up automatically once it is committed to master):

Problem compiling code in unit folder

Version

master@304ecf001d1ec4ca02e97b03ff04868b3e8c31cf on Ubuntu Eoan both arm64 and x86_64

Description of the problem

When trying to compile the code, I get the following error:

$ make V=1
gcc -DPACKAGE_NAME=\"pmix-unit\" -DPACKAGE_TARNAME=\"pmix-unit\" -DPACKAGE_VERSION=\"1.0\" -DPACKAGE_STRING=\"pmix-unit\ 1.0\" -DPACKAGE_BUGREPORT=\"http://pmix.org\" -DPACKAGE_URL=\"\" -DPMIXUNIT_CONFIGURE_USER=\"user\" -DPMIXUNIT_CONFIGURE_HOST=\"geoffroy-jetsonnano\" -DPMIXUNIT_CONFIGURE_DATE=\"Fri\ Mar\ 13\ 20:09:06\ EDT\ 2020\" -DPACKAGE=\"pmix-unit\" -DVERSION=\"1.0\" -DPMIXTEST_ARCH=\"aarch64-unknown-linux-gnu\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_HWLOC_H=1 -DHAVE_EVENT_H=1 -DPMIX_HAVE_LIBEVENT=0 -I.   -I/home/user/install/pmix_master/include -I/home/user/install/hwloc-2.0.4/include  -g -MT pmix_test.o -MD -MP -MF .deps/pmix_test.Tpo -c -o pmix_test.o pmix_test.c
In file included from server_callbacks.h:16,
                 from pmix_test.c:35:
cli_stages.h:28:10: fatal error: event.h: No such file or directory
   28 | #include <event.h>
      |          ^~~~~~~~~
compilation terminated.
make: *** [Makefile:540: pmix_test.o] Error 1

Note that the path to the libevent lib and include is not properly inserted.

Interesting output from configure:

checking for libevent in... /home/user/install/libevent-2.1.10/include and /home/user/install/libevent-2.1.10/lib
looking for header in /home/user/install/libevent-2.1.10/include
checking event.h usability... yes
checking event.h presence... yes
checking for event.h... yes
checking for library containing event_config_new... -levent
checking for evthread_set_lock_callbacks in -levent... no
configure: WARNING: External libevent does not have thread support
configure: WARNING: PMIx_unit requires libevent to be compiled with
configure: WARNING: thread support enabled
checking will libevent support be built... no
checking --with-pmix value... sanity check ok (/home/user/install/pmix_master/include)
checking libpmix.* in /home/user/install/pmix_master/lib64... not found
checking libpmix.* in /home/user/install/pmix_master/lib... found
checking for PMIx version file... found
checking version 4x... found
checking pmix.h usability... yes
checking pmix.h presence... yes
checking for pmix.h... yes
checking for PMIx_Init... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: executing depfiles commands
config.status: executing libtool commands

libevent has been configured and installed based on the following command:

$ ./configure --prefix=/home/user/install/libevent-2.1.10 --enable-thread-support

PMIx was configured with the following command:

$ ./configure --prefix=/home/user/install/pmix_master --with-libevent=/home/user/install/libevent-2.1.10 --with-hwloc=/home/user/install/hwloc-2.0.4

I did not really have the time to investigate further than that. config/pmix_unit_setup_libevent.m4 seems to be looking for evthread_set_lock_callbacks and do not find it, however, it is defined in the lib:

$ nm libevent.so | grep evthread_set_lock_callbacks
0000000000027628 T evthread_set_lock_callbacks

Updating PMIx tests

I know we are looking at updating the testing infrastructure, both for CI and in general. I just wanted to bring people's attention to a couple of suggested methods we could employ:

Jim Garelick (LLNL) suggested the following unit testing infrastructure:
openpmix/openpmix#103

Google released their testing tool a year ago:
https://opensource.googleblog.com/2019/02/open-sourcing-clusterfuzz.html

Fuzzing is an automated method for detecting bugs in software that works by feeding unexpected inputs to a target program. It is effective at finding memory corruption bugs, which often have serious security implications. Manually finding these issues is both difficult and time consuming, and bugs often slip through despite rigorous code review practices. For software projects written in an unsafe language such as C or C++, fuzzing is a crucial part of ensuring their security and stability.

We may run across other useful tools - we can capture those here as people find them.

PMIx Fence: single-job partial barrier

Test description

Verifies that the partial Fence is properly working

Test sketch

#include "pmix.h"

double max_fence_time()
{
	double fence_time = 0;
	int i;
	
	/* Measure the typical fence execution time */
	for(i = 0; i < 100; i++) {
		ts1 = timestamp();
		PMIx_Fence(without_data_collection);
		ts2 = timestamp();
		fence_time = max(fence_time, ts2 - ts1);
	}
	return fence_time;
}

int main() 
{
    double timeout, fence_time;
	
    PMIx_Init();
	
    fence_time = max_fence_time();
    T = Ratio * fence_time; // Ratio might be 100, should be selected for the particular system
	
    if( rank == 1){
        sleep(T);
    }
    if( rank % 2 ){
        ts1 = timestamp();
        PMIx_Fence(without_data_collection, only-odd-procs);
        ts2 = timestamp();
		// Odd ranks should not be affected by the rank = 1 delay
        assert( (t2 - t1) ~ fence_time);
    }
    ts1 = timestamp();
    PMIx_Fence(without_data_collection);
    ts2 = timestamp();
	
    if(rank != 1) {
        assert( (t2 - t1) ~ T);
    } else {
        assert( (t2 - t1) ~ fence_time);
    }
    PMIx_Finalize();
}

Execution details

  • 4 servers
  • 16 clients
  • Predefined (passed through cmdline) namespace
  • Predefined process placement: "0:0,1,2,3; 1:4,5,6,7; 2:8,9,10,11; 3:12,13,14,15;"
  • Ratio and "~" are selected to match the system
    • The time-dependant checks can be turned off
  • Execute M times to capture race conditions
  • The first rank is simulating the delay. The test verifies that the Fence is really synchronizing;

Client-side expectations:

  1. All PMIx calls return PMIX_SUCCESS
  2. All ranks (except rank=0) experience Fence timeout.

Server-side expectations:

  1. N invocations of:
  • client_connected
  • client_finalized
  1. Verify, that proc structure was set to the individual ranks.
  2. 2 Fence callback invocation with WILDCARD.
  3. Distance between Fence's on node0 is > T
  4. Starting from "openpmix/openpmix#1135" the size of Fence should be 0B.
  5. No other callbacks are called (no direct modex requests)
    (? Any event-related activity?)

Reference implementation:

TBD

Notes

The test suite's RTE component should implement the support for multiple in-flight Fence's.
Currently not supported.

PMIx/job info: rank positioning

Test description

Verification of the information about this ranks position within the job.

Client-side expectations

  • [side channel check] PMIx provides expected:
    • Using wildcard process:
      • PMIX_JOB_SIZE, PMIX_UNIV_SIZE
      • PMIX_LOCAL_RANK
    • Using this rank as process:
      • PMIX_LOCAL_RANK, PMIX_NODEID, PMIX_HOSTNAME
      • PMIX_LOCAL_PEERS

Server-side expectations

  • All clients connect and disconnect
  • All client processes are successfully terminated
  • No requests (i.e. direct modex or notifications) are observed.

Reference implementation

openpmix/openpmix#1831

Side notes

It was observed that PMIx_Get(PMIX_RANK) hangs. @cpshereda what is the status?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.