Giter Site home page Giter Site logo

open-mpi / ompi Goto Github PK

View Code? Open in Web Editor NEW
2.0K 121.0 837.0 174.97 MB

Open MPI main development repository

Home Page: https://www.open-mpi.org

License: Other

Makefile 2.40% Perl 1.20% Shell 3.86% C 85.95% Python 0.03% C++ 0.30% M4 1.05% Tcl 0.01% Fortran 3.26% TeX 0.24% Java 1.67% Roff 0.02% sed 0.01%
c hpc mpi fortran hacktoberfest openmpi

ompi's People

Contributors

abouteiller avatar alex-mikheev avatar artpol84 avatar awlauria avatar bosilca avatar bwbarrett avatar ddaniel avatar devreal avatar edgargabriel avatar ggouaillardet avatar gpaulsen avatar gshipman avatar hjelmn avatar hpcraink avatar hppritcha avatar janjust avatar jjhursey avatar jsquyres avatar jurenz avatar kawashima-fj avatar mike-dubman avatar rhc54 avatar rlgraham32 avatar rolfv avatar samuelkgutierrez avatar timattox avatar tkordenbrock avatar wckzhang avatar wenduwan avatar yosefe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ompi's Issues

Make mpi_preconnect_all work for multiple interfaces

The current preconnect code is BTL agnostic and uses send/recv from/to each proc. This has the benefit of pre-connecting any BTLs that used lazy connection establishment. The problem here is that when multiple BTLs are active for a given process (i.e. there are multiple endpoints) then this code only preconnects one of the BTLs and not the other.

One possible solution is to pre-connect using the BTL interface directly, looping through all the available endpoints for a proc and pre-connecting all of them.

openib btl: APM and async events support

APM support should always be in a background thread so that it can be handled ASAP. The async event handler can be a bit lazier.

For v1.3, the async handler and APM handler are always off in a separate progression thread. For v1.3.1, we should change this strategy:

  • The async handler should use the openib_fd stuff, such that it will be off in its own thread in the case of HAVE_THREAD_SUPPORT. Otherwise, it uses libevent and is called back in the usual single-threaded model.
  • The APM handler should use the openib fd stuff in the case of HAVE_THREAD_SUPPORT, but otherwise should fork off its own thread.

Fortran wrappers are calling f2c/c2f MPI functions which get intercepted by PMPI routines

The current Fortran wrappers make calls to several f2c/c2f MPI functions. This causes any PMPI interposed library to intercept these calls erroneously (ie think that the user has called these routines). Though the MPI spec http://www.mpi-forum.org/docs/mpi-11-html/node162.html#Node163 does not disallow this it seems this goes against the general OMPI rules of never calling an MPI function from inside the library. It also is a regression from what Sun did originally.

I've talked with Jeff about this issue and the below is what would need to be done to fix this issue:

We can't assume that the PMPI functions are there because there is a --disable-mpi-profile configure switch that will turn off the PMPI layer (it's there for platforms that don't have weak symbols, like OS X -- so the PMPI layer means compiling the entire MPI layer a 2nd time, which takes a lot of time; disabling it means a much faster build [for developers]).

So you just need to convert these functions to ompi__() functions (vs. PMPI__() functions) and then call those instead. Then also convert the various C MPI___F2C/C2F() functions to call these ompi__() functions as well -- so everything uniformly calls these functions: the MPI_*_C2F/F2C functions and the Fortran functions.

Make F90 MPI_IN_PLACE, MPI_ARGVS_NULL, and MPI_STATUSES_IGNORE be uniqe types

Per an e-mail exchange with Michael Kluskens, there appears to be benefit from making several of the MPI constants in F90 be unique types so that we can have unique interfaces that match just that senteniel. Specifically:

  • We can disallow using those constants where they are not allowed
  • We can prevent accidental bad arguments to functions (E.g., MPI_ARGVS_NULL is currently a double precision -- so someone could accidentally pass a double precision into MPI_COMM_SPAWN_MULTIPLE. Unlikely, but still possible -- a unique type for MPI_ARGVS_NULL would prevent this possibility)

See http://www.open-mpi.org/community/lists/users/2006/11/2115.php.

installdir functionality does not work with TV message queue plugin

The debugging message queue functionality will not work if the installdirs functionality is used at run-time to change the location of the OMPI installation. This is because the TV message queue functionality requires a hard-coded location that is read before main() to know where the OMPI MQS DLL is located.

It is unknown at this time how to fix this problem; something will have to be worked out with Etnus and Allinea to change how the global symbol is used (e.g., only examine it after some defined point where we have had a chance to change its value)? [shrug]

MPI attribute code need threading audit

In reviewing bug https://svn.open-mpi.org/trac/ompi/ticket/176, I have determined that the locking code in source:/trunk/ompi/attribute/attribute.c may not be thread safe in all cases and needs to be audited. It was written with the best of intentions :-) but then never tested and I think there are some obscure race conditions that ''could'' happen.

For example, in ompi_attr_create_keyval(), we have the following:

    OPAL_THREAD_LOCK(&alock);
    ret = CREATE_KEY(key);
    if (OMPI_SUCCESS == ret) {
        ret = opal_hash_table_set_value_uint32(keyval_hash, *key, attr);
    }
    OPAL_THREAD_UNLOCK(&alock);
    if (OMPI_SUCCESS != ret) {
        return ret;
    }

    /* Fill in the list item */

    attr->copy_attr_fn = copy_attr_fn;
    /* ...fill in more attr->values ... */

This could clearly be a problem since we set the empty keyval on the hash and therefore it's available to any other thread as soon as the lock is released -- potentially ''before'' we finish setting all the values on the attr variable (which is poorly named -- it's a keyval, not an attribute).

This one problem is easily fixed (ensure to setup attr before we assign it to the keyval hash), but it reflects that the rest of the attribute code should really be audited. Hence, this ticket is a placemarker to remember to audit this code because it may not be thread safe.

LN, LN_S and RM

When running configure under cygwin there is no way to force these 3 variables to anything else than the default values. On windows LN will not work as expected "cp -p" should be used instead.

MPI_Pack_external_size is returning the wrong values in 64-bit applications

We have a program that tests for the size returned from MPI_Pack_external_size with the external32 data representation. It should return the same value for both 32-bit and 64-bit applications, but it is returning different values.

 burl-ct-v40z-0 65 =>mpicc ext32.c -o ext32
"ext32.c", line 105: warning: shift count negative or too big: << 32
 burl-ct-v40z-0 66 =>mpirun -np 2 ext32
First test passed
Second test passed
Third test passed
ext32: PASSED
 burl-ct-v40z-0 67 =>mpicc -xarch=amd64 ext32.c -o ext32_amd64
 burl-ct-v40z-0 68 =>mpirun -np 2 ext32_amd64 
First test passed
Second test failed. Got size of 80, expected 40
Third test failed. Got size of 6400, expected 3200
[burl-ct-v40z-0:13864] *** An error occurred in MPI_Pack_external
[burl-ct-v40z-0:13864] *** on communicator MPI_COMM_WORLD
[burl-ct-v40z-0:13864] *** MPI_ERR_TRUNCATE: message truncated
[burl-ct-v40z-0:13864] *** MPI_ERRORS_ARE_FATAL (goodbye)
 burl-ct-v40z-0 69 =>

Orted prolog and epilog hooks

Terry and I were talking about the possibility of having per-job prolog and epilog steps in the orted. That is, an MCA parameter that identifies an argv to run before the first local proc of a job is launched on the node and after the last local proc of a job has completed. Typical argv would usually be a local script (perhaps to perform some site-specific administrative stuff). If the argv for the prolog/epilog is blank (which would be the default), then nothing would be launched for these steps. Hence, these would be hooks available to sysadmins if they want to use them.

I'm guessing/assuming that this would not be difficult to do -- it's mainly a matter of:

  • Finding the right place in the orted to run the prolog and epilog
  • Deciding what information to give to the prolog and epilog (e.g., passing a pile of relevant info in environment variables, such as the job ID, the session directory, the argv of the job, the exit conditions of the job, etc. -- anything that the prolog and epilog might want to know. Just about every resource manager have prolog/epilog functionality -- we might look to them for inspiration on what kind of information could be useful).

It ''might'' be useful to also have the same prolog/epilog hooks for each process in a job on the host as well. [shrug]

I'm initially marking this as a 1.3 milestone, but have no real requirement for it in v1.3 -- it seems like an easy / neat / useful idea, but there is no ''need'' to have it in v1.3. It could be pushed forward.

Rename openib BTL to ofrc

It was decided a while ago to rename the openib BTL to be "ofrc" (!OpenFabrics using reliable connections).

This is mostly menial labor, but there is one significant problem: we '''must''' have backwards compatibility for all the "openib" MCA parameter names because they've been on our web site and mailing list posts for (literally) years. For example, the following must work:

shell$ mpirun --mca btl openib,self ...
shell$ mpirun --mca btl_some_well_known_mca_param foo ...

This part is likely to be a bit harder than the menial labor to simply rename the directory all the symbols from "openib_" to "ofrc_" because it will likely invovle adding functionality to the MCA parameter engine. Care must be taken with this, of course, because the MCA parameter engine is kinda central to, well, everything. :-)

Paul Hargrove had some excellent suggestions on the devel list about this kind of stuff; be sure to see http://www.open-mpi.org/community/lists/devel/2007/10/2394.php.

I'm initially assigning this ticket to Andrew Friedley because he was foolish enough to bring it up on the mailing list. ;-)

Cannot launch from head node on Big Red due to oob issues

Filing this mainly so we don't forget about it...

On Big Red we cannot always launch jobs from the head node to the remote nodes. This seems to be due to the oob not finding the right communication paths.

The networking on Big Red is a bit confusing. There are 3 networks:

  1. A Myrinet network which is global to all compute nodes
  2. A GigE network which is global
  3. A GigE network which is local to each cabinet.

If the compute nodes are in the same cabinet as the head node, we use the cabinet GigE network and are fine. If we launch on a backend node, we find and use the myrinet network (or force it to use the global GigE network) and are fine.

However, if we launch from the head node to nodes which are not in the same cabinet, we do not automatically find the correct network and simply hang. I can get it to launch correctly if I pass "-mca oob_tcp_include eth3,eth1" (the global GigE interfaces on the head node and the compute nodes, respectively).

This doesn't seem to be an issue for others, and since one normally isn't supposed to launch jobs from the head node of Big Red, I'm putting this to 1.3.

Add FAQ item re: flowchart of which send protocol is used

Per soon-to-be-added items in the openfabrics portion of the FAQ, we have explanations of openib/ob1 behavior in which sending protocol is added (i.e., the "tuning long message behavior" items). Pasha suggests that it would be good to have an overall flowchart that shows how the protocols are chosen.

Attached are some images from Voltaire MPI docs that may be good starting points for such a diagram.

Solaris does not work correctly with event port polling

We have observed hangs when running applications on Solaris. It appears that this is because of the use of event ports.

Here is an example the stack trace when it hangs.

alamodome 43 =>pstack 1964 1966
1964:    IMB-MPI1.trunk barrier
fe6c060c lwp_yield (0, 1, fe25d134, fe25ce58, 4, 0) + 8
fef9e210 opal_progress (ff06f680, 0, ff06f688, 0, ff06f67c, 1) + 12c
fe5150f4 barrier  (0, fe52ce9c, fe52e9b9, fe51ab60, fe51aaa0, ff252c10) + 394
fe887ac0 ompi_mpi_init (1b4, fe2a7568, 0, 408, fee7ca4c, fed18d28) + 7e8
fea19ad4 MPI_Init (ffbff82c, ffbff830, fee8072d, b38, fee7ca4c, 35450) + 160
00012830 main     (2, ffbff84c, ffbff858, 2a800, ff3a0100, ff3a0140) + 10
000123f8 _start   (0, 0, 0, 0, 0, 0) + 108

Here is it running with an env var set so we can see the type of polling being used.

burl-ct-v440-2 140 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca btl self,sm,tcp bcast
[msg] libevent using: poll
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports

And if we change it to use devpoll, poll, or select, it works.

burl-ct-v440-2 141 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca opal_event_include poll bcast
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
Starting MPI_Bcast...
All done.
All done.
All done.
All done. 

And here is case of disabling event port, and letting the library pick next available.

burl-ct-v440-2 147 =>setenv EVENT_NOEVPORT
burl-ct-v440-2 148 =>mpirun -x EVENT_NOEVPORT -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 bcast
[msg] libevent using: poll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
Starting MPI_Bcast...
All done.
All done.
All done.
All done.

We only saw this on our debuggable builds. We did not see it with our optimized builds. It is not clear what difference in the configure is triggering this.

Here is the configure line that triggers the problem.

../configure --with-sge --disable-io-romio --enable-orterun-prefix-by-default --enable-heterogeneous --enable-trace --enable-debug --enable-shared --enable-mpi-f90 --with-mpi-f90-size=trivial --without-threads --disable-mpi-threads --disable-progress-threads CFLAGS="-g" FFLAGS="-g" --prefix=/workspace/rolfv/ompi/sparc/trunk/release --libdir=/workspace/rolfv/ompi/sparc/trunk/release/lib --includedir=/workspace/rolfv/ompi/sparc/trunk/release/include --with-wrapper-ldflags="-R/workspace/rolfv/ompi/sparc/trunk/release/lib -R/workspace/rolfv/ompi/sparc/trunk/release/lib/sparcv9" CC=cc CXX=CC F77=f77 F90=f90 --enable-cxx-exceptions

Document sm BTL MCA params

Need to have FAQ entries about the various tuning options for the sm btl.

(I've had this on my personal to-do list for forever; if I move it to a global to-do list, there's at least a slightly smaller chance that someone will have the time/ability to do it...)

Implement a "better" MPI preconnect function

As discussed in https://svn.open-mpi.org/trac/ompi/ticket/1207, implement a "better" MPI preconnect function (https://svn.open-mpi.org/trac/ompi/ticket/1207 encompassed 2 ideas: "print the MPI connection map" and "better MPI preconnect" -- so I'm splitting the preconnect stuff out into its own ticket for clarity). Copied from the old ticket:

= New "preconnect all" functionaliy =

  • Should completely replace old MPI preconnect functionality.
  • Need a new PML interface function: connect_all() that will connect this process to all others that it knows about (i.e., all ompi_proc_t's that it's aware of, which takes care of the MPI-2 dynamics cases). The main idea is to use the new active-message functionality to send an AM message tag to the remote PML peer. The message will cause a no-op function to occur on the other side, but it will force the connection to be made.
    • For BTL-related PMLs: do a btl_alloc() followed by a btl_send(). Loop over the btl_send's until they all complete or fail (i.e., keep checking the ones that return RESOURCE_BUSY).
    • For MTL-related PMLs: the function may be a no-op if there's no way to guarantee that connections are made. Or it may use the same general technique as the BTL-related PMLs: send an AM tag to its remote PML peer that causes a no-op on the remote side, but forces the connection to be made. The MTL may have specific knowledge about what needs to be done to force a connection of its lower layer.

Allow alternate IOF wireup strategies via orterun

Currently, the urm RMGR component has the following IOF setup
hard-coded in it:

  • vpid 0 gets stdin forwarded from orterun
  • orterun's stdout/stderr receives the stdout/stderr from all processes

orterun should grow some options to allow alternate IOF wireup
schemes. Some potentially worthwhile schemes include:

  • Replicating stdin to all processes in the job
  • Display stdout and/or stderr only from selected processes

To avoid scalability problems, this wireup scheme should be encoded in
the app context or some other data that is xcast out to all the
orteds (and yes, this is fine that this is orted-specific
functionality) so that acting on the IOF wirteup strategy does not
require any additional control messages in IOF -- if all processes in
the job ''know'' what the wireup strategy is, they can just setup
local data structures to reflect that and be done (assuming that
everyone else is also doing the same).

This would also allow fixing a minor code discrepancy in the ODLS
default component. Currently, it publishes stdin (if relevant),
stdout, and stderr. But it only ''unpublishes'' stdin. The reason
for this is scalibility issues: since stdin is only sent to one
process, publishing and unpublishing it only requires one IOF control
message (each). Publishing SOURCE stdout/stderr is actually a no-op
because the proxy ''always'' sends all SOURCE fragments to the svc, so
publishing it is not required. Unpublishing SOURCE endpoints ''does''
require an IOF control message, however, but since the HNP is either
about to or in the process of shutting down when we would have
unpublished, the resource leak that we cause by not unpublishing is
short-lived, and therefore it isn't done (to avoid sending N*2
unpublish requests to the SVC).

Add PGI debugger invocation to orte_base_user_debugger MCA param default value

From the user's mailing list http://www.open-mpi.org/community/lists/users/2006/07/1558.php, Andrew Caird found that the following command line syntax "mostly" works with the PGI debugger:

mpirun --debugger "pgdbg @mpirun@ @mpirun_args@" --debug -np 2 ./cpi 

Hence, we can add "pgdbg @mpirun@ @mpirun_args@ to the default value of orte_base_user_debugger so that it will be found automatically and users don't need to specify it.

However, Andrew noted that the PGI debugger doesn't fully support Open MPI yet (right now, it shows some warning message, which may be indicative of deeper problems). PGI support says that they are [pleasantly] surprised that it works with Open MPI at all, but hope to support Open MPI by the end of the year or so.

This ticket is a placeholder to add the pgdbg value to orte_base_user_debugger once the PGI debugger supports Open MPI. I don't want to add it before then because it could be misleading to users.

Fix corner case of DDT launching

As reported by Allinea:

If you have the Open MPI mpirun in your PATH and DDT is set to use MPICH Standard startup then when you start a program it will continuously launch new instances of the GUI.

This is because Open MPI has support for MPICH's -tv option but it's broken - it ignored the TOTALVIEW environment variable and launches DDT with: ddt -n NUMPROC -start PROGRAM. This, in turn, runs mpirun and spawns even more copies of DDT.

"mpirun -np 8 -tv user-app-path" needs to translate to "ddt -n 8 user-app-path" if loading DDT --- except when the TOTALVIEW env var is set. In that case you should execute 8 copies of $TOTALVIEW, one per proc on the target hosts.

That should work for everything I can think of! Our default Open MPI / DDT startup doesn't go via the "-tv" option so should be unaffected: the fix above is only to handle the case when the user has done something silly, ie. picked MPICH Standard instead of Open MPI from the available list. From my understanding, the above fix shouldn't break totalview.

Make MX MTL only open endpoint when selected

Currently, the MX MTL opens mx_endpoints, regardless of whether it is going to be used or not. This causes problems since by default MX has a very low number of available endpoints, and users can run out long before they expect to.

The mx btl does not have this problem, it only opens endpoints when needed.

OMPI build broken on OpenBSD

Neither the 1.1.1 release nor the 1.2 branch can be built on OpenBSD (3.9).

 gcc -DHAVE_CONFIG_H -I. -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../ompi/include -I../.. -O3 -DNDEBUG -fno-strict-aliasing -pthread -MT stacktrace.lo -MD -MP -MF .deps/stacktrace.Tpo -c stacktrace.c  -fPIC -DPIC -o .libs/stacktrace.o
stacktrace.c: In function `opal_show_stackframe':
stacktrace.c:232: error: `SI_ASYNCIO' undeclared (first use in this function)
stacktrace.c:232: error: (Each undeclared identifier is reported only once
stacktrace.c:232: error: for each function it appears in.)
stacktrace.c:233: error: `SI_MESGQ' undeclared (first use in this function)
gmake[3]: *** [stacktrace.lo] Error 1
gmake[3]: Leaving directory `/var/tmp/openmpi-1.1.1/opal/util'
gmake[2]: *** [all-recursive] Error 1
gmake[2]: Leaving directory `/var/tmp/openmpi-1.1.1/opal/util'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory `/var/tmp/openmpi-1.1.1/opal'
gmake: *** [all-recursive] Error 1

I can't imagine why one would use OpenBSD for high performance computing (think of the poor OpenBSD performance in general), so we might close this ticket with "wontfix". (just wanted to let you know...)

Allow F90 MPI_BUFFER_DETACH to return correct pointer

The current F77 MPI_BUFFER_DETACH implementation does not return the detached buffer pointer to the caller -- it simply does not make sense to do this in F77 because a) you can't get it, b) pointer implementations between compilers seem to differ, and c) even among the F77 compilers that do support pointers, you can't compare or use the pointer in a meaningful way. There are two precedents that support this interpretation: LAM/MPI and CT6 both do not return the pointer to F77 callers.

Oh, and users of buffered sends should be punished, anyway. :-)

However, this is a problem for the F90 bindings, which are [mostly] layered on top of the F77 bindings. In F90, you can manage memory much like C, so it does make sense to return the detached buffer though the F90 API. Hence, we need to override the default MPI F90 interface for MPI_DETATCH_BUFFER and have a specific implementation that returns the buffer pointer to the caller.

Here's some nuggets of information that may be helpful from an e-mail exchange from us and a Fortran expert at Sun (Ian B.):


Terry's e-mail to Ian:

In MPI there is a function pair to called MPI_Buffer_attach and MPI_Buffer_detach. These are used by the application program to give the MPI library some buffer space to use for the buffered communications functions.

When you call MPI_Buffer_attach you pass it a pointer to a buffer that you want MPI to use. In C when you call MPI_Buffer_detach you pass it a pointer to a pointer in which the MPI library returns to you the pointer to the buffer you passed it via the MPI_Buffer_attach. For C I can see this being used if you don't keep around the pointer to the buffer and you want to free the buffer returned by MPI_Buffer_detach.

My question is the above applicable to Fortran programs at all? Could do something similar with Fortran (90-03) pointers?


Ian's response:

Yes, one could do that with f90 pointers. You would need an interface for the MPI_Buffer_* routines, or else the pointer arguments won't be passed correctly. Something like

  interface
    subroutine MPI_Buffer_attach(p)
      integer, pointer, intent(in) :: p(:)
    end subroutine
  end interface
  interface
    subroutine MPI_Buffer_detach(p)
      integer, pointer, intent(out) :: p(:)
    end subroutine
  end interface

(I haven't actually tried compiling that, so caveat emptor.)

Global / local MCA parameters

Suggestion of having "global" and "local" MCA parameters in MCA config files.

  • Local MCA params would be exactly what they are today (and in the absence of a designator, params default to local -- for backwards file format compatability), meaning that they are not sent to processes on remote nodes.
  • Global MCA params would be bundled up by orterun and sent to all processes in the job, even if the user did not override those MCA parameters on the command line (or environment or ...) or not.

The intent is to be able to have a central set of MCA params that comes from a single location (i.e., you don't need to propagate the MCA params config file to all nodes in the job).

Add support for blocking progress

There was much discussion at the Paris meeting for how to add support blocking progress. This ticket is a placeholder for that functionality.

A bunch of VxWorks issues

An enterprising Mercury employee (Ken Cain) has been noodling around with getting OMPI to compile on vxworks. After talking extensively with him at a conference, he sent a list of current issues that he is having:


Hello Jeff,

At the OFA reception tonight you asked me to send the list of porting issues I've seen so far with OMPI for !VxWorks PPC. It's just a raw list that reflects a work in progress, sorry for the messiness...

-Ken

  1. configure issues with "checking prefix for global symbol labels"

1a. !VxWorks assembler (CCAS=asppc) generates a.out by default (vs. conftest.o that we need subsequently)

there is this fragment to determine the way to assemble conftest.s:

if test "$CC" = "$CCAS" ; then
    ompi_assemble="$CCAS $CCASFLAGS -c conftest.s >conftest.out 2>&1"
else
    ompi_assemble="$CCAS $CCASFLAGS conftest.s >conftest.out 2>&1"
fi

The subsequent link fails because conftest.o does not exist:

   ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o conftest > conftest.link 2>&1"

To work around the problem, I did not set CCAS. This gives me the first
invocation that includes the -c argument to CC=ccppc, generating
conftest.o output.

1b. linker fails because LDFLAGS are not passed

The same linker command line caused problems because $CFLAGS were passed
to the linker

   ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o conftest > conftest.link 2>&1"

In my environment, I set CC/CFLAGS/LDFLAGS as follows:

CC=ccppc

CFLAGS=-ggdb3 -std=c99 -pedantic -mrtp -msoft-float -mstrict-align
-mregnames -fno-builtin -fexceptions'

LDFLAGS=-mrtp -msoft-float -Wl,--start-group -Wl,--end-group
-L/amd/raptor/root/opt/WindRiver/vxworks-6.3/target/usr/lib/ppc/PPC32/sfcommon

The linker flags are not passed because the ompi_link

[xp-kcain1:build_vxworks]  ccppc -ggdb3 -std=c99 -pedantic -mrtp -msoft-float -mstrict-align -mregnames -fno-builtin -fexceptions -o hello hello.c
/amd/raptor/root/opt/WindRiver/gnu/3.4.4-vxworks-6.3/x86-linux2/bin/../lib/gcc/powerpc-wrs-vxworks/3.4.4/../../../../powerpc-wrs-vxworks/bin/ld: 
cannot find -lc_internal
collect2: ld returned 1 exit status
  1. OPAL atomics asm.c:

int versus int32_t (refer to email with Brian Barrett

  1. OPAL event/event.c: sys/time.h and timercmp() macros not defined by !VxWorks refer to workaround in event.c using #ifdef MCS_VXWORKS
  2. OPAL event/event.c: pipe() syscall not found

workaround:

#ifdef HAVE_UNISTD_H
#include <unistd.h>
#ifdef MCS_VXWORKS
#include <ioLib.h>      /* for pipe() */
#endif
#endif
  1. OPAL event/signal.c
static sig_atomic_t opal_evsigcaught[NSIG];

NSIG is not defined, but _NSIGS is

In Linux, NSIG is defined with -D__USE_MISC

So I added this code fragment to signal.c:

/* VxWorks signal.h defines _NSIGS, not NSIG */
#ifdef MCS_VXWORKS
#define NSIG (_NSIGS+1)
#endif
  1. OPAL event/signal.c: no socketpair()

workaround: use pipe():

#ifdef HAVE_UNISTD_H
#include <unistd.h>
#ifdef MCS_VXWORKS
#include <ioLib.h>      /* for pipe() */
#endif
#endif

and later in void opal_evsignal_init(sigset_t *evsigmask)

#ifdef MCS_VXWORKS
        if (pipe(ev_signal_pair) == -1)
                event_err(1, "%s: pipe", __func__);
#else
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, ev_signal_pair) == -1)
        event_err(1, "%s: socketpair", __func__);
#endif
  1. OPAL util/basename.c: #if HAVE_DIRNAME problem
../../../opal/util/basename.c:23:5: warning: "HAVE_DIRNAME" is not defined
../../../opal/util/basename.c: In function `opal_dirname':

problem: HAVE_DIRNAME is not defined in opal_config.h so the #if HAVE_DIRNAME will fail at preprocessor/compile time

workaround:

change #if HAVE_DIRNAME to #if defined(HAVE_DIRNAME)

  1. OPAL util/basename.c: strncopy_s and _strdup
../../../opal/util/basename.c: In function `opal_dirname':
../../../opal/util/basename.c:153: error: implicit declaration of
function `strncpy_s'
../../../opal/util/basename.c:160: error: implicit declaration of
function `_strdup'

#ifdef MCS_VXWORKS
        strncpy( ret, filename, p - filename);
#else
                strncpy_s( ret, (p - filename + 1), filename, p - filename );
#endif
#ifdef MCS_VXWORKS
    return strdup(".");
#else
    return _strdup(".");
#endif
  1. opal/util/if.c: socket() prototype not found in vxworks headers
#ifdef HAVE_SYS_SOCKET_H
#include <sys/socket.h>
#ifdef MCS_VXWORKS
#include <sockLib.h>
#endif
#endif
  1. opal/util/if.c: ioctl()
#ifdef HAVE_SYS_IOCTL_H
#include <sys/ioctl.h>
#ifdef MCS_VXWORKS
#include <ioLib.h>
#endif
#endif
  1. opal/util/os_path.c: MAXPATHLEN change to PATH_MAX
#ifdef MCS_VXWORKS
    if (total_length > PATH_MAX) {  /* path length is too long - reject
it */
        return(NULL);
#else
    if (total_length > MAXPATHLEN) {  /* path length is too long -
reject it */
        return(NULL);
#endif
  1. opal/util/output.c: gethostname()
#include <hostLib.h>
  1. opal/util/output.c: MAXPATHLEN

same fix as os_path.c above

  1. opal/util/output.c: closelog/openlog/syslog

manually turned off HAVE_SYSLOG_H in opal_config.h, then got a patch from Jeff Squyres that avoids syslog

  1. opal/util/opal_pty.c

complains about mismatched prototype of opal_openpty() between this source file and opal_pty.h

workaround: manually edit build_vxworks_ppc/opal/include/opal_config.h, use the following line (change 1 to 0):

#define OMPI_ENABLE_PTY_SUPPORT 0
  1. opal/util/stacktrace.c

FPE_FLTINV not present in signal.h

workaround: edit opal_config.h to turn off

OMPI_WANT_PRETTY_PRINT_STACKTRACE (this can be explicitly configured out
but I don't want to reconfigure because I hacked item 15 above)

  1. opal/mca/base/mca_base_open.c

gethostname() -- same as opal/util/output.c, must include hostLib.h

  1. opal_progress.c

from opal/event/event.h (that I modified earlier)

cannot find #include <sys/_timeradd.h>

It is in opal/event/compat/sys

workaround: change event.h to include the definitions that are present in _timeradd.h instead of including it.

  1. Link errors for opal_wrapper
strcasecmp
strncasecmp

I rolled my own in mca_base_open.c (temporary fix, since we may come across this problem elsewhere in the code).

  1. dss_internal.h uses a type 'uint'

Not sure if it's depending on something in the headers, or something it
defined on its own.

I changed it to be just like the header I found somewhere under Linux /usr/include:

#ifdef MCS_VXWORKS
typedef unsigned int uint;
#endif
  1. struct iovec definition needed
orte/mca/iof/base/iof_base_fragment.h:45: warning: array type has incomplete element type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif

not sure if this is right, or if I should include something like <netBufLib.h> or <ioLib.h>

  1. iof_base_setup.c

struct termios not understood

can only find termios.h header in 'diab' area and I'm not using that compiler.

a variable usepty is set to 0 already when OMPI_ENABLE_PTY_SUPPORT is 0.
So, why are we compiling this fragment of code at all? I hacked the file
so that the struct termios code will not get compiled.

  1. oob_base_send/recv.c, oob_base_send/recv_nb.c. struct iovec not known.
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
  1. orte/mca/rmgr/base/rmgr_base_check_context.c:58: error:
`MAXHOSTNAMELEN' undeclared (first use in this function)
#ifdef MCS_VXWORKS
#define MAXHOSTNAMELEN 64
#endif
  1. orte/mca/rmgr/base/rmgr_base_check_context.c:58:
    gethostname()
#ifdef MCS_VXWORKS
#include <hostLib.h>
#endif
  1. Compile problem
orte/mca/iof/proxy/iof_proxy.h:135: warning: array type has incomplete element type
../../../../../orte/mca/iof/proxy/iof_proxy.h:135: error: field `proxy_iov' has incomplete type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
  1. Compile problem
/orte/mca/iof/svc/iof_svc.h:147: warning: array type has incomplete element type
../../../../../orte/mca/iof/svc/iof_svc.h:147: error: field `svc_iov' has incomplete type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
  1. Compile problem
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:66: warning: array type has incomplete element type
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:66: error: field `msg_iov' has incomplete type
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h: In function `mca_oob_tcp_msg_iov_alloc':
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:196: error: invalid application of `sizeof' to incomplete type `iovec'
  1. Compile problem
../../../../../orte/mca/oob/tcp/oob_tcp.c:344: error: implicit declaration of function `accept'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_create_listen':
../../../../../orte/mca/oob/tcp/oob_tcp.c:383: error: implicit declaration of function `socket'
../../../../../orte/mca/oob/tcp/oob_tcp.c:399: error: implicit declaration of function `bind'
../../../../../orte/mca/oob/tcp/oob_tcp.c:407: error: implicit declaration of function `getsockname'
../../../../../orte/mca/oob/tcp/oob_tcp.c:415: error: implicit declaration of function `listen'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_listen_thread':
../../../../../orte/mca/oob/tcp/oob_tcp.c:459: error: implicit declaration of function `bzero'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_recv_probe':
../../../../../orte/mca/oob/tcp/oob_tcp.c:696: error: implicit declaration of function `send'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_recv_handler':
../../../../../orte/mca/oob/tcp/oob_tcp.c:795: error: implicit declaration of function `recv'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_init':
../../../../../orte/mca/oob/tcp/oob_tcp.c:1087: error: implicit declaration of function `usleep'

This gets rid of most (except bzero and usleep)

#ifdef MCS_VXWORKS
#include <sockLib.h>
#endif

Trying to reconfigure the package so CFLAGS will not include -pedantic.
This is because $WIND_HOME/vxworks-6.3/target/h/string.h has protos for
bzero, but only when #if _EXTENSION_WRS is true. So turn off
-ansi/-pedantic gets this? In my dreams?

Add support for ibv_req_notify_cq()

In discussions with Roland this week, it would probably be beneficial for us to add support for ibv_req_notify_cq() -- allowing the openib btl to block waiting for progress.

The standard ways for exploiting this would be:

  • 100% polling, no notifying (current method)
  • Some polling, then falling back to using notify if no activity occurs in a timeout
  • 100% blocking -- no polling (or perhaps only 1 poll)

The obvious questions come up about multi-btl issues, but if we get an fd back, it might not be too terrible to utilize. This could also be combined with directed interrupts -- send interrupt X to core Y to wakeup, etc.

Have ompi_proc_t destructor call pml.del_procs

The ompi_proc_t destructor (see source:/trunk/ompi/proc/proc.c) really should call the del_procs method on the current PML so that the PML can release all resources associated with that peer process.

This is not currently done, mainly because it involves a lot of untested code paths. It ''should'' work ok, because we do call pml del_procs during MPI_FINALIZE. But there are at least some thread safety issues involved (e.g., ensure that one thread calling del_procs won't hose ongoing pml actions in another thread), and at least some BTLs (incorrectly) treat del_procs as a no-op.

These issues should be investigated and fixed.

Implement MPIX_GREQUEST_START

Rob Latham proposed MPIX_GREQUEST_START as described in "Extending the MPI-2 Generalized Request Interface":http://www-unix.mcs.anl.gov/~thakur/papers/grequest-redesign.pdf (PDF). Prototypes are in the paper (no use reproducing them here).

The main improvement is allowing generalized requests to specify their own progression function that will be invoked by MPI's progress engine. This MPIX_GREQUEST_START function is in MPICH2 and is now used by newer versions of ROMIO.

Allow different btl_openib_receive_queues values

The OMPI v1.3 series disallows specifying different receive_queues values per HCA -- see the thread starting here:

http://www.open-mpi.org/community/lists/devel/2008/05/3896.php

We may want to revisit this topic in future versions.

IPv6 support in rdmacm cpc of openib

Currently, rdmacm cannot handle if an IPv6 address is passed in (regardless of whether the adapter can handle it or not). This should be fixed as soon as possible.

When rdmacm is setting up the listeners threads (via the cbc_query call), it must verify the hca/port in question has a valid IP address. To make this check, it calls mca_btl_openib_rdma_get_ipv4addr (in ompi/mca/btl/openib/btl_openib_iwarp.c) which queries the IPv4 address. This needs to be expanded to check for IPv6 addresses.

Also, rdmacm should check to see if IPv6 is supported by the adapter and handle failure of rdma_bind_addr based on that parameter. Based upon the value of the IP address and the value of the attribute max_raw_ipv6_qp, we can set the sin_family to AF_INET or AF_INET6 when creating the listener.

Not checking OF device attributes when making QP's, etc.

Galen noticed in a code review of the openib BTL that there are a few places where we are creating WQE's and other items without checking the attributes on the device to see how many it can actually handle. So we may attempt to exceed the limits unintentionally, and then get a generic error back from the creation function. These types of errors should either be avoidable or be able to give better warning/error messages because we can detect exactly what the problem is if we're a little more thorough / defensive in the setup.

I'm temporarily assigning this ticket to Galen so that he can cite some specifics. Who actually fixes them is a different issue. :-)

Heterogeneous Multi-port OpenIB fails various IMB tests

Between PPC64 and X86-64 using two ports of Open IB we fail a number of IMB benchmarks. This does not occur on the 1.2 branch with heterogeneous patches.


#----------------------------------------------------------------
# Benchmarking Allreduce 
# #processes = 2 
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.07         0.17         0.12
            4         1000        17.23        17.24        17.24
            8         1000        18.78        18.79        18.78
           16         1000        19.19        19.21        19.20
           32         1000        21.32        21.34        21.33
           64         1000        25.07        25.09        25.08
          128         1000        34.13        34.15        34.14
          256         1000        49.14        49.17        49.15
          512         1000        71.47        71.52        71.50
         1024         1000       122.10       122.17       122.14
         2048         1000       219.91       220.04       219.97
         4096         1000       417.00       417.24       417.12
m in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes

Routed RML message ordering problem

From Ralph and Jeff: With the current RML retransmission scheme, we have a possible message ordering problem. That is, if the RML queues up a message to transmit later, lots of other messages may (and frequently do) get transmitted successfully before the event timer wakes up and transmits the one message that was queued up.

This does not seem to affect overall functionality -- everything seems to work fine. But I'm wondering if that's a side effect of how we use the RML/OOB and not really because it's bug-free. Specifically, I think that this RML queueing functionality is currently triggered either during an all-to-one or one-to-all kind of communication pattern. So the ordering doesn't really matter.

But consider the following scenario:

  • process A RML sends message 1 to process B
  • the route map is not yet setup, so the message gets queued
  • the route map arrives and process A sets itself up properly
  • process A RML sends message 2 to process B on the same OOB tag as message 1
  • process B receives message 2
  • process A's event timer wakes up and transmits message 1
  • process B receives message 1

Process B clearly gets these messages out of order.

I believe this could be very bad as we move away from an event-driven system to one that is message-driven as message ordering could have a major impact on behavior.

What we would really like to see happen, IMHO, is for the queued message to be delivered -immediately- when the contact info becomes known, and not wait for some arbitrary clock to tick down. The queued message(s) should go to the head of the line when that contact info shows up, in the order in which they were received.

Otherwise, we could get into some significant ordering issues once the next major ORTE update hits since all control logic will be sent via RML.

The current problem only shows up on startup during the allgather in modex, so it isn't a problem as (a) it is the collector that is slow to provide its contact info, and (b) the entire startup blocks on the collector getting all of the required info. I -suspect- we might therefore ride through this problem, but again, it is a "bug" that could easily bite us. Just can't predict where/when at the moment.

Show MPI connectivity map during MPI_INIT

It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.


OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.

But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.

Complications:

  • In some cases, OMPI doesn't know which networks will be used for
    communications with each MPI process peer; we only know which ones
    we'll try to use when connections are actually established (per
    OMPI's lazy connection model for the OB1 PML). But I think that
    even outputting this information will be useful.
  • Connectivity between MPI processes are likely to be non-uniform.
    E.g., MCW rank 0 may use the sm btl to communicate with some MPI
    processes, but a different btl to communicate with others. This is
    almost certainly a different view than other processes have. The
    connectivity information needs to be conveyed on a process-pair
    basis (e.g., a 2D chart).
  • Since we have to span multiple PMLs, this may require an addition
    to the PML API.

A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:

  • Human readable, full map (see above)
  • Human readable, abbreviated (see below for some ideas on this)
  • Machine parsable, full map
  • Machine parsable, abbreviated

It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:

  • The above example could be represented as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp
  • Another example:
MPI connectivty map, listed by process:
X->X: self
other: tcp
  • Another example:
MPI connectivty map, listed by process:
all: CM PML, MX MTL
  • Perhaps something could be done with "exceptions" -- e.g., where
    the openib BTL is being used for inter-node connectivity ''except''
    for one node (where IB is malfunctioning, and OMPI fell back to
    TCP) -- this is a common case that users/sysadmins want to detect.

Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:

MCW rank 0     1     2     3     4     5
0        self      sm        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
1        sm        self      tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
2        tcp:eth0  tcp:eth0  self      sm        tcp:eth0  tcp:eth0
3        tcp:eth0  tcp:eth0  sm        self      tcp:eth0  tcp:eth0
4        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  self      sm
5        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  sm        self

With more information such as interface names, compression of the
output becomes much more important, such as:

MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1

Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.

Improve ompi_info's component detection system

ompi_info has a hard-coded list of frameworks that must be manually updated every time a new framework has been added. This has long-since been noted as an abstraction violation and "icky". Some idle chat at yesterday's MPI Forum meeting between George, Josh, and myself resulted in thinking of a way to fix this problem and make ompi_info be much more generic:

  • Have autogen.sh create a C array of strings of all framework names that is instantiated somewhere (probably in OPAL)
  • Add a new MCA base "open all the components" routine (that probably uses much of the same infrastructure as the current "open this framework's components") that does the following:
    • Traverses mca_base_component_paths and opens ''all'' components that it finds (regardless of type)
    • Traverses the autogen.sh-created list of frameworks and lt_dlsym's looking for the framework's symbol of statically linked components
    • Move all framework-level MCA parameter registration out of mca_open() functions to a new function: mca_register_mca(). lt_dlsym for this symbol for each framework, and call it if it exists
  • ompi_info therefore will get a list of ''all'' components which can then be sorted and displayed as appropriate

It may be desirable as part of this process to also separate MCA base component MCA parameter registration from the "open" function (because the "open" function does have a distinct purpose [to be a first-line place to check whether the component wants to run] that is currently munged together with registering component-level MCA parameters. For backwards compatibility, the MCA base can continue to call the component open function if an MCA register function is not available.

loopback verbs connections: per adapter setting

Most current iWARP devices (June 2008) cannot make connections between two processes on the same server. As such, btl_openib.c has been set to mark all local peers on iWARP transports as "unreachable". However, there is nothing in the iWARP spec that prevents loopback connections; it's an implementation decision.

The real solution is to add another parameter to the INI file that indicates whether a given adapter can handle loopback connections or not. This is likely not too important to do until an iWARP NIC supports loopback connections or an IB NIC doesn't support loopback connections.

Provide warnings if unknown MCA params used

It has long been a problem that users may supply incorrect or unknown MCA parameters and therefore get incorrect or undesired behavior. For example, a user may misspell an MCA parameter name on the mpirun command line and OMPI effectively ignores it because the name that the user provides is effectively never "seen" by the MCA base. That is, there is no error checking in the MCA base to see if there are MCA parameters supplied that do not exist.

While such consistency checking would be extremely helpful to users, it is a fairly difficult problem to solve. Here's a recent mail that I sent on the topic:


I think we all agree that this is something that would be Very Good to have. The reason that it hasn't been done is because I'm not sure how to do it. :-( Actually, more specifically, I can think of several complex ways to do it, but they're all quite unattractive.

The problem is that we don't necessarily have global knowledge of all MCA parameters. Consider this example:

mpirun --mca pls_tm_foo 1 --mca btl_openib_foo 1 -np 4 a.out

These MCA params are going to be visible to three types of processes:

  • mpirun
  • orted
  • a.out (assumedly an MPI process)

So how do we tell mpirun and orted that they should ignore the btl_openib MCA parameter, and tell a.out that it should ignore the pls_tm MCA parameter? There are other, similar corner cases (e.g., what if some node doesn't have the openib BTL component, but others do?).

There are a few ways to do this that I can think of:

  1. each app registers frameworks that it is and is not interested in -- assuming that all MCA params follow the prefix rule, we can parse out which params in the environment belong to which framework (ugh) and then find a) any that fall outside of that (e.g., mis-typed frameworks), and b) any that are in the frameworks of interest that do not match registered params. This doesn't handle all corner cases, though (e.g., openib on some nodes but not all).
  2. some entity (mpirun, most likely) does an ompi_info-like "open all frameworks" and can directly check all MCA params right away. This is an abstraction violation because orterun will be opening frameworks that it should have no knowledge of (e.g., MPI frameworks).
  3. some entity (mpirun, most likely) fork/exec's ompi_info in a special mode that checks for invalid MCA params in the environment (because it will inherit the params for mpirun). This is nice because then mpirun doesn't have to open all the frameworks, but it's an abstraction violation because orterun doesn't know about ompi_info (different layers).

So the first one is the only one that is actually viable (i.e., doesn't cause abstraction violation). But it's still klunky, awkward, and doesn't handle all cases. If anyone has any better ideas, I'm all ears...


Since writing the above e-mail, I had another idea -- address the common case and provide a workaround for the others. Specifically, do not worry about the case where some nodes have component A and others do not. Hence, in this scenario if a user supplies an MCA param for component A, the processes on some nodes will be ok with it (because they have component A), but others will consider it "unrecognized" (because they do not have component A), and will print a warning/error -- potentially causing the job to fail.

To address this, we can add [yet another] MCA parameter to disable this MCA parameter checking. The default value will be to enable MCA parameter checking, but if a user knows what they're doing, or if they fall into the corner case above, they can disable MCA parameter checking and be "good enough."

It's not perfect and it certainly doesn't cover all cases, but it does cover today's common case (where all nodes are homogeneous) and would probably be a good step forward.

PGI compiler f90 issues with -g

With the upcoming PGI 7.0 compiler (although I think the issue is the same with the 6.2 series as well). Reported here with a trivial F90 module, although the issue is identical with the MPI F90 bindings:


Short version: if I compile a F90 module with "-g" enabled, symbols can't be found in it when I try to use that module. If I don't use "-g", everything works fine.

It's best shown through example. Here's the compile with "-g" enabled:

[10:45] svbu-mpi:~/tmp/subdir % pgf90 -V

pgf90 7.0-2a 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2006, STMicroelectronics, Inc.  All Rights Reserved.
[10:45] svbu-mpi:~/tmp/subdir % cat test_module.f90
module OMPI_MOD_FLAG

  type OMPI_MOD_FLAG_TYPE
    integer :: i
  end type OMPI_MOD_FLAG_TYPE

end module OMPI_MOD_FLAG
[10:45] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -g
[10:45] svbu-mpi:~/tmp/subdir % ls -l
total 16
-rw-rw-r--  1 jsquyres named  386 Feb  2 10:45 ompi_mod_flag.mod
-rw-rw-r--  1 jsquyres named   71 Feb  2 10:42 program.f90
-rw-rw-r--  1 jsquyres named  166 Feb  2 10:40 test_module.f90
-rw-rw-r--  1 jsquyres named 2800 Feb  2 10:45 test_module.o
[10:45] svbu-mpi:~/tmp/subdir % cat program.f90
program f90usemodule
  use OMPI_MOD_FLAG
end program f90usemodule

[10:45] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -g
/tmp/pgf90pGObTVjlE0LC.o(.debug_info+0x87): undefined reference to `..Dm_ompi_mod_flag'
[10:45] svbu-mpi:~/tmp/subdir %

And here's the same stuff without the -g2:

[10:48] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2
[10:48] svbu-mpi:~/tmp/subdir % ls -l
total 16
-rw-rw-r--  1 jsquyres named  386 Feb  2 10:48 ompi_mod_flag.mod
-rw-rw-r--  1 jsquyres named   71 Feb  2 10:42 program.f90
-rw-rw-r--  1 jsquyres named  166 Feb  2 10:40 test_module.f90
-rw-rw-r--  1 jsquyres named 1256 Feb  2 10:48 test_module.o
[10:48] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2
[10:48] svbu-mpi:~/tmp/subdir %

Just for fun, let's try the other permutations -- (-g in the module / not in the module, and -g in the program / not in the program):

-g in the module -- works fine:

[10:49] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2 -g
[10:49] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2
[10:49] svbu-mpi:~/tmp/subdir %

-g in the program -- doesn't work:

[10:49] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2
[10:50] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2 -g
/tmp/pgf90uzUb85jXTwvY.o(.debug_info+0x87): undefined reference to `..Dm_ompi_mod_flag'
[10:50] svbu-mpi:~/tmp/subdir %

I'm not setting a milestone on this; I don't know if there's anything that we actually want to do about this (perhaps just is just a FAQ/documentation issue?). But I wanted to record the issue somewhere.

Allow flexible stdin routing to COMM_SPAWN'ed jobs

Related to but slightly different than https://svn.open-mpi.org/trac/ompi/ticket/1050:

When we COMM_SPAWN, where does stdin for the child process come from? I think that there should be [at least] 4 options (selectable via MPI_Info keys):

  1. Get stdin from the HNP. Note that this is a bit weird: any stdin from the HNP will be sent to '''both''' the parent job ''and'' the child job.
  2. Get the stdout from the single parent process who called orte_spawn. This would be like standard unix pipes, a la "foo | bar", where foo's output is sent to the input of bar.
  3. Get the stdout from the entire parent job who called orte_spawn. This is similar to the previous option, but note that ''all'' stdout from the entire job will be sent to the stdin in the child.
  4. Have stdin tied to /dev/null. This is effectively what happens in OMPI <=v1.2, so I think that this should be the default.

Note that I didn't mention ''where'' in the child job the stdin flows -- it could be just to vpid 0, or it could be to one or more other processes, or ... I think that's what ticket https://svn.open-mpi.org/trac/ompi/ticket/1050 is about, and is a slightly different issue than this ticket.

Make new MCA fw open call to open mandatory components

In talks with various developers, it seems like we have several places in the code base that have ''required'' components. For example:

  • BTL: needs "self"
  • Coll: needs "self" and "basic"
  • RAS: needs "dash_host" and "localhost" (sorta)

We've also seen users screw this up -- most often in the btl or coll cases, where they do something like this:

shell$ mpirun --mca btl openib ...

And then get confused when they try to MPI send a message to themselves and have the PML/BTL barf because it has no BTL path to send to itself.

One possible way to make this nicer is to either modify the existing mca_base_components_open() function to take another argument (or leave the mca_base_components_open() interface alone and make a new function, perhaps named mca_base_components_open_required() that is the same as mca_base_components_open() but has the new parameter). This new argument can be a list of components that ''have'' to be opened, and are therefore excluded from the "--mca " selection criteria.

We can add error checking in there such that if someone runs:

shell$ mpirun --mca coll ^basic ...

and the coll base lists "basic" in the "required" list, a friendly error message can be printed, etc. You get the idea.

The point is that this might help a bunch of the code base become simpler if it can be assumed that certain components are always available (it would simplify a bunch of the coll base, for example). It would also allow the Law of Least Astonishment for users running:

shell$ mpirun --mca btl openib ...

This kind of scenario would then work as the user expects because the BTL base will silently be loading "self" in the background (note that this is up to the framework -- so in an MTL situation, we wouldn't be calling the btl_base_open(), so this mandatory loading of the BTL self component wouldn't apply, etc.).

Add several datatypes to Fortran and C++ bindings

MPI-2.1 states:

MPI_LONG_LONG_INT, MPI_LONG_LONG (as synonym), MPI_UNSIGNED_LONG_LONG, MPI_SIGNED_CHAR, and MPI_WCHAR are moved from optional to official and they are therefore defined for all three language bindings.

We have all of these types in mpi.h, but a quick glance shows that we don't have them in mpif.h and some are missing from the C++ bindings.

BTL checkpoint friendly

So far only self and tcp seems to be check-friendly. SM has a small issue, but that might be removed with a little work. All others BTL has to be investigated. A testing file will be added shortly once I figure out how to do it ...

Recursive behavior in ompi_mpi_abort()

Per [http://www.open-mpi.org/community/lists/devel/2007/08/2220.php this thread on the devel list], ompi_mpi_abort() may actually be invoked recursively via the progression engine. Additionally it is possible that multiple threads may invoke ompi_mpi_abort() simultaneously in a THREAD_MULTIPLE scenario. Clearly, only one thread should be allowed to do the actual "abort" processing.

Not-thread-safe protection was added near the top of ompi_mpi_abort() a while ago in the form of logic that looks like this:

    if (have_been_invoked) {
        return OMPI_SUCCESS;
    }
    have_been_invoked = true;

However, this is clearly bad because it violates assumptions elsewhere in the code that ompi_mpi_abort() will not return (i.e., Bad Things can/will happen, like segvs).

Adding protection for the THREAD_MULTIPLE scenario is probably easy enough; looping over sleep (or progress?) is probably fine.

But sleep/progress-looping is ''not'' the right solution for recursive invocations from the thread that is actually doing the abort processing because there are at least some cases where progress will not occur until control pops all the way back to the top of the progress stack.

So - what to do?

BTLs progress should progress up to one real message

The SM BTL component progress currently progresses only one message per connection even if the message is really a control message (in this case an ACK message). The negative impact of this is if a program does an MPI_Iprobe it will end up doing multiple MPI_Iprobes when in theory for certain cases it really should only need to do one MPI_Iprobe.

So for the SM BTL component progress I propose draining all control messages until we've either hit an empty fifo or a "real" message.

The other BTLs will need to be investigated to make sure they adhere to similar rules otherwise you would end up with inconsistent results depending on the BTL you are using.

Add FAQ entry about firewalls

Two questions come up about firewalls now and again:

  1. someone mysteriously can't get OMPI to work because the TCP OOB can't connect. Solution is to disable all firewalls (e.g., Linux iptables) on all machines, or at least allow all ports to connect between machines running OMPI.
  2. someone wants to use OMPI through a firewall. Current gen OMPI doesn't support this, but perhaps UTK will change this someday.

This has been on my to-do list for forever; perhaps by putting this as a ticket, someone will actually get around to adding this to the FAQ.

rework mca_btl_tcp_proc_accept code

In the multicluster case, we've seen hanging connections, that's why I changed the acceptance rules in btl_tcp_proc.c (r18169). This change caused weird connection resets on sif and odin, unfortunately not reproducable somewhere else (for me).

JFTR, the discussion: http://www.open-mpi.org/community/lists/devel/2008/04/3711.php

I've reverted the commit in r18255 and took a look at the code. Let me come up with a two-line-fix in a few hours. I'll add tprins to this ticket for documentation purpose. The checkin will also refer to this ticket.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.