open-mpi / ompi Goto Github PK
View Code? Open in Web Editor NEWOpen MPI main development repository
Home Page: https://www.open-mpi.org
License: Other
Open MPI main development repository
Home Page: https://www.open-mpi.org
License: Other
The current preconnect code is BTL agnostic and uses send/recv from/to each proc. This has the benefit of pre-connecting any BTLs that used lazy connection establishment. The problem here is that when multiple BTLs are active for a given process (i.e. there are multiple endpoints) then this code only preconnects one of the BTLs and not the other.
One possible solution is to pre-connect using the BTL interface directly, looping through all the available endpoints for a proc and pre-connecting all of them.
APM support should always be in a background thread so that it can be handled ASAP. The async event handler can be a bit lazier.
For v1.3, the async handler and APM handler are always off in a separate progression thread. For v1.3.1, we should change this strategy:
The current Fortran wrappers make calls to several f2c/c2f MPI functions. This causes any PMPI interposed library to intercept these calls erroneously (ie think that the user has called these routines). Though the MPI spec http://www.mpi-forum.org/docs/mpi-11-html/node162.html#Node163 does not disallow this it seems this goes against the general OMPI rules of never calling an MPI function from inside the library. It also is a regression from what Sun did originally.
I've talked with Jeff about this issue and the below is what would need to be done to fix this issue:
We can't assume that the PMPI functions are there because there is a --disable-mpi-profile configure switch that will turn off the PMPI layer (it's there for platforms that don't have weak symbols, like OS X -- so the PMPI layer means compiling the entire MPI layer a 2nd time, which takes a lot of time; disabling it means a much faster build [for developers]).
So you just need to convert these functions to ompi__() functions (vs. PMPI__() functions) and then call those instead. Then also convert the various C MPI___F2C/C2F() functions to call these ompi__() functions as well -- so everything uniformly calls these functions: the MPI_*_C2F/F2C functions and the Fortran functions.
Per an e-mail exchange with Michael Kluskens, there appears to be benefit from making several of the MPI constants in F90 be unique types so that we can have unique interfaces that match just that senteniel. Specifically:
See http://www.open-mpi.org/community/lists/users/2006/11/2115.php.
The debugging message queue functionality will not work if the installdirs functionality is used at run-time to change the location of the OMPI installation. This is because the TV message queue functionality requires a hard-coded location that is read before main() to know where the OMPI MQS DLL is located.
It is unknown at this time how to fix this problem; something will have to be worked out with Etnus and Allinea to change how the global symbol is used (e.g., only examine it after some defined point where we have had a chance to change its value)? [shrug]
In reviewing bug https://svn.open-mpi.org/trac/ompi/ticket/176, I have determined that the locking code in source:/trunk/ompi/attribute/attribute.c may not be thread safe in all cases and needs to be audited. It was written with the best of intentions :-) but then never tested and I think there are some obscure race conditions that ''could'' happen.
For example, in ompi_attr_create_keyval(), we have the following:
OPAL_THREAD_LOCK(&alock);
ret = CREATE_KEY(key);
if (OMPI_SUCCESS == ret) {
ret = opal_hash_table_set_value_uint32(keyval_hash, *key, attr);
}
OPAL_THREAD_UNLOCK(&alock);
if (OMPI_SUCCESS != ret) {
return ret;
}
/* Fill in the list item */
attr->copy_attr_fn = copy_attr_fn;
/* ...fill in more attr->values ... */
This could clearly be a problem since we set the empty keyval on the hash and therefore it's available to any other thread as soon as the lock is released -- potentially ''before'' we finish setting all the values on the attr
variable (which is poorly named -- it's a keyval, not an attribute).
This one problem is easily fixed (ensure to setup attr
before we assign it to the keyval hash), but it reflects that the rest of the attribute code should really be audited. Hence, this ticket is a placemarker to remember to audit this code because it may not be thread safe.
When running configure under cygwin there is no way to force these 3 variables to anything else than the default values. On windows LN will not work as expected "cp -p" should be used instead.
We have a program that tests for the size returned from MPI_Pack_external_size with the external32 data representation. It should return the same value for both 32-bit and 64-bit applications, but it is returning different values.
burl-ct-v40z-0 65 =>mpicc ext32.c -o ext32
"ext32.c", line 105: warning: shift count negative or too big: << 32
burl-ct-v40z-0 66 =>mpirun -np 2 ext32
First test passed
Second test passed
Third test passed
ext32: PASSED
burl-ct-v40z-0 67 =>mpicc -xarch=amd64 ext32.c -o ext32_amd64
burl-ct-v40z-0 68 =>mpirun -np 2 ext32_amd64
First test passed
Second test failed. Got size of 80, expected 40
Third test failed. Got size of 6400, expected 3200
[burl-ct-v40z-0:13864] *** An error occurred in MPI_Pack_external
[burl-ct-v40z-0:13864] *** on communicator MPI_COMM_WORLD
[burl-ct-v40z-0:13864] *** MPI_ERR_TRUNCATE: message truncated
[burl-ct-v40z-0:13864] *** MPI_ERRORS_ARE_FATAL (goodbye)
burl-ct-v40z-0 69 =>
Terry and I were talking about the possibility of having per-job prolog and epilog steps in the orted. That is, an MCA parameter that identifies an argv to run before the first local proc of a job is launched on the node and after the last local proc of a job has completed. Typical argv would usually be a local script (perhaps to perform some site-specific administrative stuff). If the argv for the prolog/epilog is blank (which would be the default), then nothing would be launched for these steps. Hence, these would be hooks available to sysadmins if they want to use them.
I'm guessing/assuming that this would not be difficult to do -- it's mainly a matter of:
It ''might'' be useful to also have the same prolog/epilog hooks for each process in a job on the host as well. [shrug]
I'm initially marking this as a 1.3 milestone, but have no real requirement for it in v1.3 -- it seems like an easy / neat / useful idea, but there is no ''need'' to have it in v1.3. It could be pushed forward.
It was decided a while ago to rename the openib BTL to be "ofrc" (!OpenFabrics using reliable connections).
This is mostly menial labor, but there is one significant problem: we '''must''' have backwards compatibility for all the "openib" MCA parameter names because they've been on our web site and mailing list posts for (literally) years. For example, the following must work:
shell$ mpirun --mca btl openib,self ...
shell$ mpirun --mca btl_some_well_known_mca_param foo ...
This part is likely to be a bit harder than the menial labor to simply rename the directory all the symbols from "openib_" to "ofrc_" because it will likely invovle adding functionality to the MCA parameter engine. Care must be taken with this, of course, because the MCA parameter engine is kinda central to, well, everything. :-)
Paul Hargrove had some excellent suggestions on the devel list about this kind of stuff; be sure to see http://www.open-mpi.org/community/lists/devel/2007/10/2394.php.
I'm initially assigning this ticket to Andrew Friedley because he was foolish enough to bring it up on the mailing list. ;-)
Filing this mainly so we don't forget about it...
On Big Red we cannot always launch jobs from the head node to the remote nodes. This seems to be due to the oob not finding the right communication paths.
The networking on Big Red is a bit confusing. There are 3 networks:
If the compute nodes are in the same cabinet as the head node, we use the cabinet GigE network and are fine. If we launch on a backend node, we find and use the myrinet network (or force it to use the global GigE network) and are fine.
However, if we launch from the head node to nodes which are not in the same cabinet, we do not automatically find the correct network and simply hang. I can get it to launch correctly if I pass "-mca oob_tcp_include eth3,eth1" (the global GigE interfaces on the head node and the compute nodes, respectively).
This doesn't seem to be an issue for others, and since one normally isn't supposed to launch jobs from the head node of Big Red, I'm putting this to 1.3.
Per soon-to-be-added items in the openfabrics portion of the FAQ, we have explanations of openib/ob1 behavior in which sending protocol is added (i.e., the "tuning long message behavior" items). Pasha suggests that it would be good to have an overall flowchart that shows how the protocols are chosen.
Attached are some images from Voltaire MPI docs that may be good starting points for such a diagram.
We have observed hangs when running applications on Solaris. It appears that this is because of the use of event ports.
Here is an example the stack trace when it hangs.
alamodome 43 =>pstack 1964 1966
1964: IMB-MPI1.trunk barrier
fe6c060c lwp_yield (0, 1, fe25d134, fe25ce58, 4, 0) + 8
fef9e210 opal_progress (ff06f680, 0, ff06f688, 0, ff06f67c, 1) + 12c
fe5150f4 barrier (0, fe52ce9c, fe52e9b9, fe51ab60, fe51aaa0, ff252c10) + 394
fe887ac0 ompi_mpi_init (1b4, fe2a7568, 0, 408, fee7ca4c, fed18d28) + 7e8
fea19ad4 MPI_Init (ffbff82c, ffbff830, fee8072d, b38, fee7ca4c, 35450) + 160
00012830 main (2, ffbff84c, ffbff858, 2a800, ff3a0100, ff3a0140) + 10
000123f8 _start (0, 0, 0, 0, 0, 0) + 108
Here is it running with an env var set so we can see the type of polling being used.
burl-ct-v440-2 140 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca btl self,sm,tcp bcast
[msg] libevent using: poll
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports
[msg] libevent using: event ports
And if we change it to use devpoll, poll, or select, it works.
burl-ct-v440-2 141 =>mpirun -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 -mca opal_event_include poll bcast
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
[msg] libevent using: poll
Starting MPI_Bcast...
All done.
All done.
All done.
All done.
And here is case of disabling event port, and letting the library pick next available.
burl-ct-v440-2 147 =>setenv EVENT_NOEVPORT
burl-ct-v440-2 148 =>mpirun -x EVENT_NOEVPORT -x EVENT_SHOW_METHOD -host burl-ct-v440-3 -np 4 bcast
[msg] libevent using: poll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
[msg] libevent using: devpoll
Starting MPI_Bcast...
All done.
All done.
All done.
All done.
We only saw this on our debuggable builds. We did not see it with our optimized builds. It is not clear what difference in the configure is triggering this.
Here is the configure line that triggers the problem.
../configure --with-sge --disable-io-romio --enable-orterun-prefix-by-default --enable-heterogeneous --enable-trace --enable-debug --enable-shared --enable-mpi-f90 --with-mpi-f90-size=trivial --without-threads --disable-mpi-threads --disable-progress-threads CFLAGS="-g" FFLAGS="-g" --prefix=/workspace/rolfv/ompi/sparc/trunk/release --libdir=/workspace/rolfv/ompi/sparc/trunk/release/lib --includedir=/workspace/rolfv/ompi/sparc/trunk/release/include --with-wrapper-ldflags="-R/workspace/rolfv/ompi/sparc/trunk/release/lib -R/workspace/rolfv/ompi/sparc/trunk/release/lib/sparcv9" CC=cc CXX=CC F77=f77 F90=f90 --enable-cxx-exceptions
Per the RFC discussed in this thread:
http://www.open-mpi.org/community/lists/devel/2008/05/3845.php
We are suggesting adding "none" and "all" keywords for mca_base_open().
Need to have FAQ entries about the various tuning options for the sm btl.
(I've had this on my personal to-do list for forever; if I move it to a global to-do list, there's at least a slightly smaller chance that someone will have the time/ability to do it...)
As discussed in https://svn.open-mpi.org/trac/ompi/ticket/1207, implement a "better" MPI preconnect function (https://svn.open-mpi.org/trac/ompi/ticket/1207 encompassed 2 ideas: "print the MPI connection map" and "better MPI preconnect" -- so I'm splitting the preconnect stuff out into its own ticket for clarity). Copied from the old ticket:
= New "preconnect all" functionaliy =
Currently, the urm RMGR component has the following IOF setup
hard-coded in it:
orterun should grow some options to allow alternate IOF wireup
schemes. Some potentially worthwhile schemes include:
To avoid scalability problems, this wireup scheme should be encoded in
the app context or some other data that is xcast out to all the
orteds (and yes, this is fine that this is orted-specific
functionality) so that acting on the IOF wirteup strategy does not
require any additional control messages in IOF -- if all processes in
the job ''know'' what the wireup strategy is, they can just setup
local data structures to reflect that and be done (assuming that
everyone else is also doing the same).
This would also allow fixing a minor code discrepancy in the ODLS
default component. Currently, it publishes stdin (if relevant),
stdout, and stderr. But it only ''unpublishes'' stdin. The reason
for this is scalibility issues: since stdin is only sent to one
process, publishing and unpublishing it only requires one IOF control
message (each). Publishing SOURCE stdout/stderr is actually a no-op
because the proxy ''always'' sends all SOURCE fragments to the svc, so
publishing it is not required. Unpublishing SOURCE endpoints ''does''
require an IOF control message, however, but since the HNP is either
about to or in the process of shutting down when we would have
unpublished, the resource leak that we cause by not unpublishing is
short-lived, and therefore it isn't done (to avoid sending N*2
unpublish requests to the SVC).
From the user's mailing list http://www.open-mpi.org/community/lists/users/2006/07/1558.php, Andrew Caird found that the following command line syntax "mostly" works with the PGI debugger:
mpirun --debugger "pgdbg @mpirun@ @mpirun_args@" --debug -np 2 ./cpi
Hence, we can add "pgdbg @mpirun@ @mpirun_args@ to the default value of orte_base_user_debugger so that it will be found automatically and users don't need to specify it.
However, Andrew noted that the PGI debugger doesn't fully support Open MPI yet (right now, it shows some warning message, which may be indicative of deeper problems). PGI support says that they are [pleasantly] surprised that it works with Open MPI at all, but hope to support Open MPI by the end of the year or so.
This ticket is a placeholder to add the pgdbg value to orte_base_user_debugger once the PGI debugger supports Open MPI. I don't want to add it before then because it could be misleading to users.
As reported by Allinea:
If you have the Open MPI mpirun in your PATH and DDT is set to use MPICH Standard startup then when you start a program it will continuously launch new instances of the GUI.
This is because Open MPI has support for MPICH's -tv option but it's broken - it ignored the TOTALVIEW environment variable and launches DDT with: ddt -n NUMPROC -start PROGRAM. This, in turn, runs mpirun and spawns even more copies of DDT.
"mpirun -np 8 -tv user-app-path" needs to translate to "ddt -n 8 user-app-path" if loading DDT --- except when the TOTALVIEW env var is set. In that case you should execute 8 copies of $TOTALVIEW, one per proc on the target hosts.
That should work for everything I can think of! Our default Open MPI / DDT startup doesn't go via the "-tv" option so should be unaffected: the fix above is only to handle the case when the user has done something silly, ie. picked MPICH Standard instead of Open MPI from the available list. From my understanding, the above fix shouldn't break totalview.
Currently, the MX MTL opens mx_endpoints, regardless of whether it is going to be used or not. This causes problems since by default MX has a very low number of available endpoints, and users can run out long before they expect to.
The mx btl does not have this problem, it only opens endpoints when needed.
Neither the 1.1.1 release nor the 1.2 branch can be built on OpenBSD (3.9).
gcc -DHAVE_CONFIG_H -I. -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../ompi/include -I../.. -O3 -DNDEBUG -fno-strict-aliasing -pthread -MT stacktrace.lo -MD -MP -MF .deps/stacktrace.Tpo -c stacktrace.c -fPIC -DPIC -o .libs/stacktrace.o
stacktrace.c: In function `opal_show_stackframe':
stacktrace.c:232: error: `SI_ASYNCIO' undeclared (first use in this function)
stacktrace.c:232: error: (Each undeclared identifier is reported only once
stacktrace.c:232: error: for each function it appears in.)
stacktrace.c:233: error: `SI_MESGQ' undeclared (first use in this function)
gmake[3]: *** [stacktrace.lo] Error 1
gmake[3]: Leaving directory `/var/tmp/openmpi-1.1.1/opal/util'
gmake[2]: *** [all-recursive] Error 1
gmake[2]: Leaving directory `/var/tmp/openmpi-1.1.1/opal/util'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory `/var/tmp/openmpi-1.1.1/opal'
gmake: *** [all-recursive] Error 1
I can't imagine why one would use OpenBSD for high performance computing (think of the poor OpenBSD performance in general), so we might close this ticket with "wontfix". (just wanted to let you know...)
The current F77 MPI_BUFFER_DETACH implementation does not return the detached buffer pointer to the caller -- it simply does not make sense to do this in F77 because a) you can't get it, b) pointer implementations between compilers seem to differ, and c) even among the F77 compilers that do support pointers, you can't compare or use the pointer in a meaningful way. There are two precedents that support this interpretation: LAM/MPI and CT6 both do not return the pointer to F77 callers.
Oh, and users of buffered sends should be punished, anyway. :-)
However, this is a problem for the F90 bindings, which are [mostly] layered on top of the F77 bindings. In F90, you can manage memory much like C, so it does make sense to return the detached buffer though the F90 API. Hence, we need to override the default MPI F90 interface for MPI_DETATCH_BUFFER and have a specific implementation that returns the buffer pointer to the caller.
Here's some nuggets of information that may be helpful from an e-mail exchange from us and a Fortran expert at Sun (Ian B.):
Terry's e-mail to Ian:
In MPI there is a function pair to called MPI_Buffer_attach and MPI_Buffer_detach. These are used by the application program to give the MPI library some buffer space to use for the buffered communications functions.
When you call MPI_Buffer_attach you pass it a pointer to a buffer that you want MPI to use. In C when you call MPI_Buffer_detach you pass it a pointer to a pointer in which the MPI library returns to you the pointer to the buffer you passed it via the MPI_Buffer_attach. For C I can see this being used if you don't keep around the pointer to the buffer and you want to free the buffer returned by MPI_Buffer_detach.
My question is the above applicable to Fortran programs at all? Could do something similar with Fortran (90-03) pointers?
Ian's response:
Yes, one could do that with f90 pointers. You would need an interface for the MPI_Buffer_* routines, or else the pointer arguments won't be passed correctly. Something like
interface
subroutine MPI_Buffer_attach(p)
integer, pointer, intent(in) :: p(:)
end subroutine
end interface
interface
subroutine MPI_Buffer_detach(p)
integer, pointer, intent(out) :: p(:)
end subroutine
end interface
(I haven't actually tried compiling that, so caveat emptor.)
Suggestion of having "global" and "local" MCA parameters in MCA config files.
The intent is to be able to have a central set of MCA params that comes from a single location (i.e., you don't need to propagate the MCA params config file to all nodes in the job).
There was much discussion at the Paris meeting for how to add support blocking progress. This ticket is a placeholder for that functionality.
An enterprising Mercury employee (Ken Cain) has been noodling around with getting OMPI to compile on vxworks. After talking extensively with him at a conference, he sent a list of current issues that he is having:
Hello Jeff,
At the OFA reception tonight you asked me to send the list of porting issues I've seen so far with OMPI for !VxWorks PPC. It's just a raw list that reflects a work in progress, sorry for the messiness...
-Ken
1a. !VxWorks assembler (CCAS=asppc) generates a.out by default (vs. conftest.o that we need subsequently)
there is this fragment to determine the way to assemble conftest.s:
if test "$CC" = "$CCAS" ; then
ompi_assemble="$CCAS $CCASFLAGS -c conftest.s >conftest.out 2>&1"
else
ompi_assemble="$CCAS $CCASFLAGS conftest.s >conftest.out 2>&1"
fi
The subsequent link fails because conftest.o does not exist:
ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o conftest > conftest.link 2>&1"
To work around the problem, I did not set CCAS. This gives me the first
invocation that includes the -c argument to CC=ccppc, generating
conftest.o output.
1b. linker fails because LDFLAGS are not passed
The same linker command line caused problems because $CFLAGS were passed
to the linker
ompi_link="$CC $CFLAGS conftest_c.$OBJEXT conftest.$OBJEXT -o conftest > conftest.link 2>&1"
In my environment, I set CC/CFLAGS/LDFLAGS as follows:
CC=ccppc
CFLAGS=-ggdb3 -std=c99 -pedantic -mrtp -msoft-float -mstrict-align
-mregnames -fno-builtin -fexceptions'
LDFLAGS=-mrtp -msoft-float -Wl,--start-group -Wl,--end-group
-L/amd/raptor/root/opt/WindRiver/vxworks-6.3/target/usr/lib/ppc/PPC32/sfcommon
The linker flags are not passed because the ompi_link
[xp-kcain1:build_vxworks] ccppc -ggdb3 -std=c99 -pedantic -mrtp -msoft-float -mstrict-align -mregnames -fno-builtin -fexceptions -o hello hello.c
/amd/raptor/root/opt/WindRiver/gnu/3.4.4-vxworks-6.3/x86-linux2/bin/../lib/gcc/powerpc-wrs-vxworks/3.4.4/../../../../powerpc-wrs-vxworks/bin/ld:
cannot find -lc_internal
collect2: ld returned 1 exit status
int versus int32_t (refer to email with Brian Barrett
workaround:
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#ifdef MCS_VXWORKS
#include <ioLib.h> /* for pipe() */
#endif
#endif
static sig_atomic_t opal_evsigcaught[NSIG];
NSIG is not defined, but _NSIGS is
In Linux, NSIG is defined with -D__USE_MISC
So I added this code fragment to signal.c:
/* VxWorks signal.h defines _NSIGS, not NSIG */
#ifdef MCS_VXWORKS
#define NSIG (_NSIGS+1)
#endif
workaround: use pipe():
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#ifdef MCS_VXWORKS
#include <ioLib.h> /* for pipe() */
#endif
#endif
and later in void opal_evsignal_init(sigset_t *evsigmask)
#ifdef MCS_VXWORKS
if (pipe(ev_signal_pair) == -1)
event_err(1, "%s: pipe", __func__);
#else
if (socketpair(AF_UNIX, SOCK_STREAM, 0, ev_signal_pair) == -1)
event_err(1, "%s: socketpair", __func__);
#endif
../../../opal/util/basename.c:23:5: warning: "HAVE_DIRNAME" is not defined
../../../opal/util/basename.c: In function `opal_dirname':
problem: HAVE_DIRNAME is not defined in opal_config.h so the #if HAVE_DIRNAME will fail at preprocessor/compile time
workaround:
change #if HAVE_DIRNAME to #if defined(HAVE_DIRNAME)
../../../opal/util/basename.c: In function `opal_dirname':
../../../opal/util/basename.c:153: error: implicit declaration of
function `strncpy_s'
../../../opal/util/basename.c:160: error: implicit declaration of
function `_strdup'
#ifdef MCS_VXWORKS
strncpy( ret, filename, p - filename);
#else
strncpy_s( ret, (p - filename + 1), filename, p - filename );
#endif
#ifdef MCS_VXWORKS
return strdup(".");
#else
return _strdup(".");
#endif
#ifdef HAVE_SYS_SOCKET_H
#include <sys/socket.h>
#ifdef MCS_VXWORKS
#include <sockLib.h>
#endif
#endif
#ifdef HAVE_SYS_IOCTL_H
#include <sys/ioctl.h>
#ifdef MCS_VXWORKS
#include <ioLib.h>
#endif
#endif
#ifdef MCS_VXWORKS
if (total_length > PATH_MAX) { /* path length is too long - reject
it */
return(NULL);
#else
if (total_length > MAXPATHLEN) { /* path length is too long -
reject it */
return(NULL);
#endif
#include <hostLib.h>
same fix as os_path.c above
manually turned off HAVE_SYSLOG_H in opal_config.h, then got a patch from Jeff Squyres that avoids syslog
complains about mismatched prototype of opal_openpty() between this source file and opal_pty.h
workaround: manually edit build_vxworks_ppc/opal/include/opal_config.h, use the following line (change 1 to 0):
#define OMPI_ENABLE_PTY_SUPPORT 0
FPE_FLTINV not present in signal.h
workaround: edit opal_config.h to turn off
OMPI_WANT_PRETTY_PRINT_STACKTRACE (this can be explicitly configured out
but I don't want to reconfigure because I hacked item 15 above)
gethostname() -- same as opal/util/output.c, must include hostLib.h
from opal/event/event.h (that I modified earlier)
cannot find #include <sys/_timeradd.h>
It is in opal/event/compat/sys
workaround: change event.h to include the definitions that are present in _timeradd.h instead of including it.
strcasecmp
strncasecmp
I rolled my own in mca_base_open.c (temporary fix, since we may come across this problem elsewhere in the code).
Not sure if it's depending on something in the headers, or something it
defined on its own.
I changed it to be just like the header I found somewhere under Linux /usr/include:
#ifdef MCS_VXWORKS
typedef unsigned int uint;
#endif
orte/mca/iof/base/iof_base_fragment.h:45: warning: array type has incomplete element type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
not sure if this is right, or if I should include something like <netBufLib.h> or <ioLib.h>
struct termios not understood
can only find termios.h header in 'diab' area and I'm not using that compiler.
a variable usepty is set to 0 already when OMPI_ENABLE_PTY_SUPPORT is 0.
So, why are we compiling this fragment of code at all? I hacked the file
so that the struct termios code will not get compiled.
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
`MAXHOSTNAMELEN' undeclared (first use in this function)
#ifdef MCS_VXWORKS
#define MAXHOSTNAMELEN 64
#endif
#ifdef MCS_VXWORKS
#include <hostLib.h>
#endif
orte/mca/iof/proxy/iof_proxy.h:135: warning: array type has incomplete element type
../../../../../orte/mca/iof/proxy/iof_proxy.h:135: error: field `proxy_iov' has incomplete type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
/orte/mca/iof/svc/iof_svc.h:147: warning: array type has incomplete element type
../../../../../orte/mca/iof/svc/iof_svc.h:147: error: field `svc_iov' has incomplete type
#ifdef MCS_VXWORKS
#include <net/uio.h>
#endif
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:66: warning: array type has incomplete element type
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:66: error: field `msg_iov' has incomplete type
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h: In function `mca_oob_tcp_msg_iov_alloc':
../../../../../orte/mca/oob/tcp/oob_tcp_msg.h:196: error: invalid application of `sizeof' to incomplete type `iovec'
../../../../../orte/mca/oob/tcp/oob_tcp.c:344: error: implicit declaration of function `accept'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_create_listen':
../../../../../orte/mca/oob/tcp/oob_tcp.c:383: error: implicit declaration of function `socket'
../../../../../orte/mca/oob/tcp/oob_tcp.c:399: error: implicit declaration of function `bind'
../../../../../orte/mca/oob/tcp/oob_tcp.c:407: error: implicit declaration of function `getsockname'
../../../../../orte/mca/oob/tcp/oob_tcp.c:415: error: implicit declaration of function `listen'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_listen_thread':
../../../../../orte/mca/oob/tcp/oob_tcp.c:459: error: implicit declaration of function `bzero'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_recv_probe':
../../../../../orte/mca/oob/tcp/oob_tcp.c:696: error: implicit declaration of function `send'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_recv_handler':
../../../../../orte/mca/oob/tcp/oob_tcp.c:795: error: implicit declaration of function `recv'
../../../../../orte/mca/oob/tcp/oob_tcp.c: In function `mca_oob_tcp_init':
../../../../../orte/mca/oob/tcp/oob_tcp.c:1087: error: implicit declaration of function `usleep'
This gets rid of most (except bzero and usleep)
#ifdef MCS_VXWORKS
#include <sockLib.h>
#endif
Trying to reconfigure the package so CFLAGS will not include -pedantic.
This is because $WIND_HOME/vxworks-6.3/target/h/string.h has protos for
bzero, but only when #if _EXTENSION_WRS is true. So turn off
-ansi/-pedantic gets this? In my dreams?
In discussions with Roland this week, it would probably be beneficial for us to add support for ibv_req_notify_cq() -- allowing the openib btl to block waiting for progress.
The standard ways for exploiting this would be:
The obvious questions come up about multi-btl issues, but if we get an fd back, it might not be too terrible to utilize. This could also be combined with directed interrupts -- send interrupt X to core Y to wakeup, etc.
The ompi_proc_t destructor (see source:/trunk/ompi/proc/proc.c) really should call the del_procs method on the current PML so that the PML can release all resources associated with that peer process.
This is not currently done, mainly because it involves a lot of untested code paths. It ''should'' work ok, because we do call pml del_procs during MPI_FINALIZE. But there are at least some thread safety issues involved (e.g., ensure that one thread calling del_procs won't hose ongoing pml actions in another thread), and at least some BTLs (incorrectly) treat del_procs as a no-op.
These issues should be investigated and fixed.
Rob Latham proposed MPIX_GREQUEST_START as described in "Extending the MPI-2 Generalized Request Interface":http://www-unix.mcs.anl.gov/~thakur/papers/grequest-redesign.pdf (PDF). Prototypes are in the paper (no use reproducing them here).
The main improvement is allowing generalized requests to specify their own progression function that will be invoked by MPI's progress engine. This MPIX_GREQUEST_START function is in MPICH2 and is now used by newer versions of ROMIO.
Several openib BTL MCA params can be tuned on a per-device basis:
The variable portion can be a device, a device:port, or device:port:lid.
The OMPI v1.3 series disallows specifying different receive_queues values per HCA -- see the thread starting here:
http://www.open-mpi.org/community/lists/devel/2008/05/3896.php
We may want to revisit this topic in future versions.
Currently, rdmacm cannot handle if an IPv6 address is passed in (regardless of whether the adapter can handle it or not). This should be fixed as soon as possible.
When rdmacm is setting up the listeners threads (via the cbc_query call), it must verify the hca/port in question has a valid IP address. To make this check, it calls mca_btl_openib_rdma_get_ipv4addr (in ompi/mca/btl/openib/btl_openib_iwarp.c) which queries the IPv4 address. This needs to be expanded to check for IPv6 addresses.
Also, rdmacm should check to see if IPv6 is supported by the adapter and handle failure of rdma_bind_addr based on that parameter. Based upon the value of the IP address and the value of the attribute max_raw_ipv6_qp, we can set the sin_family to AF_INET or AF_INET6 when creating the listener.
Galen noticed in a code review of the openib BTL that there are a few places where we are creating WQE's and other items without checking the attributes on the device to see how many it can actually handle. So we may attempt to exceed the limits unintentionally, and then get a generic error back from the creation function. These types of errors should either be avoidable or be able to give better warning/error messages because we can detect exactly what the problem is if we're a little more thorough / defensive in the setup.
I'm temporarily assigning this ticket to Galen so that he can cite some specifics. Who actually fixes them is a different issue. :-)
Between PPC64 and X86-64 using two ports of Open IB we fail a number of IMB benchmarks. This does not occur on the 1.2 branch with heterogeneous patches.
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.07 0.17 0.12
4 1000 17.23 17.24 17.24
8 1000 18.78 18.79 18.78
16 1000 19.19 19.21 19.20
32 1000 21.32 21.34 21.33
64 1000 25.07 25.09 25.08
128 1000 34.13 34.15 34.14
256 1000 49.14 49.17 49.15
512 1000 71.47 71.52 71.50
1024 1000 122.10 122.17 122.14
2048 1000 219.91 220.04 219.97
4096 1000 417.00 417.24 417.12
m in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
unpack there is still room in the input buffer 2 bytes
From Ralph and Jeff: With the current RML retransmission scheme, we have a possible message ordering problem. That is, if the RML queues up a message to transmit later, lots of other messages may (and frequently do) get transmitted successfully before the event timer wakes up and transmits the one message that was queued up.
This does not seem to affect overall functionality -- everything seems to work fine. But I'm wondering if that's a side effect of how we use the RML/OOB and not really because it's bug-free. Specifically, I think that this RML queueing functionality is currently triggered either during an all-to-one or one-to-all kind of communication pattern. So the ordering doesn't really matter.
But consider the following scenario:
Process B clearly gets these messages out of order.
I believe this could be very bad as we move away from an event-driven system to one that is message-driven as message ordering could have a major impact on behavior.
What we would really like to see happen, IMHO, is for the queued message to be delivered -immediately- when the contact info becomes known, and not wait for some arbitrary clock to tick down. The queued message(s) should go to the head of the line when that contact info shows up, in the order in which they were received.
Otherwise, we could get into some significant ordering issues once the next major ORTE update hits since all control logic will be sent via RML.
The current problem only shows up on startup during the allgather in modex, so it isn't a problem as (a) it is the collector that is slow to provide its contact info, and (b) the entire startup blocks on the collector getting all of the required info. I -suspect- we might therefore ride through this problem, but again, it is a "bug" that could easily bite us. Just can't predict where/when at the moment.
There is some level of support for MPI_THREAD_MULTIPLE in v1.3; it needs to be precisely documented exactly what MPI applications can/cannot do.
The error handling in the openib btl initialization is not very consistent -- sometimes we return NULL, sometimes we goto no_btl, etc. It would be good to clean this up before v1.3.
It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.
OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.
But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.
Complications:
A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):
MCW rank 0 1 2 3 4 5
0 self sm tcp tcp tcp tcp
1 sm self tcp tcp tcp tcp
2 tcp tcp self sm tcp tcp
3 tcp tcp sm self tcp tcp
4 tcp tcp tcp tcp self sm
5 tcp tcp tcp tcp sm self
Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:
It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp
MPI connectivty map, listed by process:
X->X: self
other: tcp
MPI connectivty map, listed by process:
all: CM PML, MX MTL
Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:
MCW rank 0 1 2 3 4 5
0 self sm tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0
1 sm self tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0
2 tcp:eth0 tcp:eth0 self sm tcp:eth0 tcp:eth0
3 tcp:eth0 tcp:eth0 sm self tcp:eth0 tcp:eth0
4 tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0 self sm
5 tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0 sm self
With more information such as interface names, compression of the
output becomes much more important, such as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1
Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.
ompi_info has a hard-coded list of frameworks that must be manually updated every time a new framework has been added. This has long-since been noted as an abstraction violation and "icky". Some idle chat at yesterday's MPI Forum meeting between George, Josh, and myself resulted in thinking of a way to fix this problem and make ompi_info be much more generic:
It may be desirable as part of this process to also separate MCA base component MCA parameter registration from the "open" function (because the "open" function does have a distinct purpose [to be a first-line place to check whether the component wants to run] that is currently munged together with registering component-level MCA parameters. For backwards compatibility, the MCA base can continue to call the component open function if an MCA register function is not available.
Most current iWARP devices (June 2008) cannot make connections between two processes on the same server. As such, btl_openib.c has been set to mark all local peers on iWARP transports as "unreachable". However, there is nothing in the iWARP spec that prevents loopback connections; it's an implementation decision.
The real solution is to add another parameter to the INI file that indicates whether a given adapter can handle loopback connections or not. This is likely not too important to do until an iWARP NIC supports loopback connections or an IB NIC doesn't support loopback connections.
It has long been a problem that users may supply incorrect or unknown MCA parameters and therefore get incorrect or undesired behavior. For example, a user may misspell an MCA parameter name on the mpirun command line and OMPI effectively ignores it because the name that the user provides is effectively never "seen" by the MCA base. That is, there is no error checking in the MCA base to see if there are MCA parameters supplied that do not exist.
While such consistency checking would be extremely helpful to users, it is a fairly difficult problem to solve. Here's a recent mail that I sent on the topic:
I think we all agree that this is something that would be Very Good to have. The reason that it hasn't been done is because I'm not sure how to do it. :-( Actually, more specifically, I can think of several complex ways to do it, but they're all quite unattractive.
The problem is that we don't necessarily have global knowledge of all MCA parameters. Consider this example:
mpirun --mca pls_tm_foo 1 --mca btl_openib_foo 1 -np 4 a.out
These MCA params are going to be visible to three types of processes:
So how do we tell mpirun and orted that they should ignore the btl_openib MCA parameter, and tell a.out that it should ignore the pls_tm MCA parameter? There are other, similar corner cases (e.g., what if some node doesn't have the openib BTL component, but others do?).
There are a few ways to do this that I can think of:
So the first one is the only one that is actually viable (i.e., doesn't cause abstraction violation). But it's still klunky, awkward, and doesn't handle all cases. If anyone has any better ideas, I'm all ears...
Since writing the above e-mail, I had another idea -- address the common case and provide a workaround for the others. Specifically, do not worry about the case where some nodes have component A and others do not. Hence, in this scenario if a user supplies an MCA param for component A, the processes on some nodes will be ok with it (because they have component A), but others will consider it "unrecognized" (because they do not have component A), and will print a warning/error -- potentially causing the job to fail.
To address this, we can add [yet another] MCA parameter to disable this MCA parameter checking. The default value will be to enable MCA parameter checking, but if a user knows what they're doing, or if they fall into the corner case above, they can disable MCA parameter checking and be "good enough."
It's not perfect and it certainly doesn't cover all cases, but it does cover today's common case (where all nodes are homogeneous) and would probably be a good step forward.
With the upcoming PGI 7.0 compiler (although I think the issue is the same with the 6.2 series as well). Reported here with a trivial F90 module, although the issue is identical with the MPI F90 bindings:
Short version: if I compile a F90 module with "-g" enabled, symbols can't be found in it when I try to use that module. If I don't use "-g", everything works fine.
It's best shown through example. Here's the compile with "-g" enabled:
[10:45] svbu-mpi:~/tmp/subdir % pgf90 -V
pgf90 7.0-2a 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2006, STMicroelectronics, Inc. All Rights Reserved.
[10:45] svbu-mpi:~/tmp/subdir % cat test_module.f90
module OMPI_MOD_FLAG
type OMPI_MOD_FLAG_TYPE
integer :: i
end type OMPI_MOD_FLAG_TYPE
end module OMPI_MOD_FLAG
[10:45] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -g
[10:45] svbu-mpi:~/tmp/subdir % ls -l
total 16
-rw-rw-r-- 1 jsquyres named 386 Feb 2 10:45 ompi_mod_flag.mod
-rw-rw-r-- 1 jsquyres named 71 Feb 2 10:42 program.f90
-rw-rw-r-- 1 jsquyres named 166 Feb 2 10:40 test_module.f90
-rw-rw-r-- 1 jsquyres named 2800 Feb 2 10:45 test_module.o
[10:45] svbu-mpi:~/tmp/subdir % cat program.f90
program f90usemodule
use OMPI_MOD_FLAG
end program f90usemodule
[10:45] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -g
/tmp/pgf90pGObTVjlE0LC.o(.debug_info+0x87): undefined reference to `..Dm_ompi_mod_flag'
[10:45] svbu-mpi:~/tmp/subdir %
And here's the same stuff without the -g2:
[10:48] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2
[10:48] svbu-mpi:~/tmp/subdir % ls -l
total 16
-rw-rw-r-- 1 jsquyres named 386 Feb 2 10:48 ompi_mod_flag.mod
-rw-rw-r-- 1 jsquyres named 71 Feb 2 10:42 program.f90
-rw-rw-r-- 1 jsquyres named 166 Feb 2 10:40 test_module.f90
-rw-rw-r-- 1 jsquyres named 1256 Feb 2 10:48 test_module.o
[10:48] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2
[10:48] svbu-mpi:~/tmp/subdir %
Just for fun, let's try the other permutations -- (-g in the module / not in the module, and -g in the program / not in the program):
-g in the module -- works fine:
[10:49] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2 -g
[10:49] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2
[10:49] svbu-mpi:~/tmp/subdir %
-g in the program -- doesn't work:
[10:49] svbu-mpi:~/tmp/subdir % pgf90 -c test_module.f90 -O2
[10:50] svbu-mpi:~/tmp/subdir % pgf90 program.f90 -I. -O2 -g
/tmp/pgf90uzUb85jXTwvY.o(.debug_info+0x87): undefined reference to `..Dm_ompi_mod_flag'
[10:50] svbu-mpi:~/tmp/subdir %
I'm not setting a milestone on this; I don't know if there's anything that we actually want to do about this (perhaps just is just a FAQ/documentation issue?). But I wanted to record the issue somewhere.
Related to but slightly different than https://svn.open-mpi.org/trac/ompi/ticket/1050:
When we COMM_SPAWN, where does stdin for the child process come from? I think that there should be [at least] 4 options (selectable via MPI_Info keys):
Note that I didn't mention ''where'' in the child job the stdin flows -- it could be just to vpid 0, or it could be to one or more other processes, or ... I think that's what ticket https://svn.open-mpi.org/trac/ompi/ticket/1050 is about, and is a slightly different issue than this ticket.
In talks with various developers, it seems like we have several places in the code base that have ''required'' components. For example:
We've also seen users screw this up -- most often in the btl or coll cases, where they do something like this:
shell$ mpirun --mca btl openib ...
And then get confused when they try to MPI send a message to themselves and have the PML/BTL barf because it has no BTL path to send to itself.
One possible way to make this nicer is to either modify the existing mca_base_components_open() function to take another argument (or leave the mca_base_components_open() interface alone and make a new function, perhaps named mca_base_components_open_required() that is the same as mca_base_components_open() but has the new parameter). This new argument can be a list of components that ''have'' to be opened, and are therefore excluded from the "--mca " selection criteria.
We can add error checking in there such that if someone runs:
shell$ mpirun --mca coll ^basic ...
and the coll base lists "basic" in the "required" list, a friendly error message can be printed, etc. You get the idea.
The point is that this might help a bunch of the code base become simpler if it can be assumed that certain components are always available (it would simplify a bunch of the coll base, for example). It would also allow the Law of Least Astonishment for users running:
shell$ mpirun --mca btl openib ...
This kind of scenario would then work as the user expects because the BTL base will silently be loading "self" in the background (note that this is up to the framework -- so in an MTL situation, we wouldn't be calling the btl_base_open(), so this mandatory loading of the BTL self component wouldn't apply, etc.).
When the shared memory backing file does not get created, the error code generated is the out of resource error code, which is not very descriptive of helpful when trying to track down problems. Need to change the code to be a bit more descriptive.
MPI-2.1 states:
MPI_LONG_LONG_INT, MPI_LONG_LONG (as synonym), MPI_UNSIGNED_LONG_LONG, MPI_SIGNED_CHAR, and MPI_WCHAR are moved from optional to official and they are therefore defined for all three language bindings.
We have all of these types in mpi.h, but a quick glance shows that we don't have them in mpif.h and some are missing from the C++ bindings.
So far only self and tcp seems to be check-friendly. SM has a small issue, but that might be removed with a little work. All others BTL has to be investigated. A testing file will be added shortly once I figure out how to do it ...
Per [http://www.open-mpi.org/community/lists/devel/2007/08/2220.php this thread on the devel list], ompi_mpi_abort() may actually be invoked recursively via the progression engine. Additionally it is possible that multiple threads may invoke ompi_mpi_abort() simultaneously in a THREAD_MULTIPLE scenario. Clearly, only one thread should be allowed to do the actual "abort" processing.
Not-thread-safe protection was added near the top of ompi_mpi_abort() a while ago in the form of logic that looks like this:
if (have_been_invoked) {
return OMPI_SUCCESS;
}
have_been_invoked = true;
However, this is clearly bad because it violates assumptions elsewhere in the code that ompi_mpi_abort() will not return (i.e., Bad Things can/will happen, like segvs).
Adding protection for the THREAD_MULTIPLE scenario is probably easy enough; looping over sleep (or progress?) is probably fine.
But sleep/progress-looping is ''not'' the right solution for recursive invocations from the thread that is actually doing the abort processing because there are at least some cases where progress will not occur until control pops all the way back to the top of the progress stack.
So - what to do?
The SM BTL component progress currently progresses only one message per connection even if the message is really a control message (in this case an ACK message). The negative impact of this is if a program does an MPI_Iprobe it will end up doing multiple MPI_Iprobes when in theory for certain cases it really should only need to do one MPI_Iprobe.
So for the SM BTL component progress I propose draining all control messages until we've either hit an empty fifo or a "real" message.
The other BTLs will need to be investigated to make sure they adhere to similar rules otherwise you would end up with inconsistent results depending on the BTL you are using.
Two questions come up about firewalls now and again:
This has been on my to-do list for forever; perhaps by putting this as a ticket, someone will actually get around to adding this to the FAQ.
In the multicluster case, we've seen hanging connections, that's why I changed the acceptance rules in btl_tcp_proc.c (r18169). This change caused weird connection resets on sif and odin, unfortunately not reproducable somewhere else (for me).
JFTR, the discussion: http://www.open-mpi.org/community/lists/devel/2008/04/3711.php
I've reverted the commit in r18255 and took a look at the code. Let me come up with a two-line-fix in a few hours. I'll add tprins to this ticket for documentation purpose. The checkin will also refer to this ticket.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.