ashleypittman / padb Goto Github PK
View Code? Open in Web Editor NEWParallel Debugging tool for HPC applications
Home Page: http://padb.pittman.org.uk
License: GNU Lesser General Public License v2.1
Parallel Debugging tool for HPC applications
Home Page: http://padb.pittman.org.uk
License: GNU Lesser General Public License v2.1
The --signal option take a number for a signal to deliver to processes
within the job, it does not however take symbolic names for signals which
makes it harder to use, particularly for people who don't know the signal
numbers.
The following command could reasonably be expected to work.
ashley@alpha:~$ padb -a --kill --signal STOP
Value "STOP" invalid for option signal (number expected)
ashley@alpha:~$
Original issue reported on code.google.com by [email protected]
on 9 Jun 2009 at 9:55
Attempting to run padb --stack-trace on nodes which don't have gdb
installed results in a unclear error, this case should be handled better.
ashley@fnarp:~$ padb -a -x -t
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Failed to run parallel command (rc = 9)
Can't use an undefined value as an ARRAY reference at
/home/ashley/padb/padb/src/padb line 1899.
Original issue reported on code.google.com by [email protected]
on 15 Jun 2009 at 2:43
When showing variables in stack traces some variables cannot or are not
shown with their real, native value. Up until r319 these variables were
reported with as as string encased in pointy braces. This does not play
well with html rendering however which removes the braces and everything
inside them (and has other potentially bad consequences).
r319 removes these braces leaving just the string which allows the values
to be printed on web pages but a better solution is needed for the long
term, perhaps a --format-for-html option?
-----------------
[0-3] (4 processes)
-----------------
main() at MPI_Allgather_c.c:123
params
int argc = '1' [0-3]
char ** argv: <more than 3 distinct values>
locals
int byte_length = '65536' [0-3]
MPI_Comm comm = '<MPI_COMM_NULL>' [0-3]
int comm_count = '2' [0-3]
int comm_index = '2' [0-3]
int comm_type = '-16' [0-3]
int * counts: <more than 3 distinct values>
int * displs: <more than 3 distinct values>
int error = '0' [0-3]
int fail = '0' [0-3]
int i:
'1' [0]
'4' [1-3]
int ierr = '0' [0-3]
char [256] info_buf = '<value too long to display>' [0-3]
int inter_flag = '0' [0-3]
int j = '0' [0-3]
int length = '113' [0-3]
int length_count = '18' [0-3]
int loop_cnt:
'234' [1-3]
'468' [0]
int max_byte_length = '65536' [0-3]
int max_length = '113' [0-3]
void * recv_buffer = '<valid pointer perm=rw-p>' [0-3]
void * send_buffer = '<valid pointer perm=rw-p>' [0-3]
int size = '0' [0-3]
int test_nump = '1' [0-3]
int test_type = '13' [0-3]
char [128] testname = '<value too long to display>' [0-3]
int type_count = '13' [0-3]
struct dataTemplate value = '<value too long to display>' [0-3]
struct dataTemplate * values = '<valid pointer perm=rw-p>' [0-3]
-----------------
[0-3] (4 processes)
-----------------
MPITEST_get_communicator() at libmpitest.c:3956
params
int context = '-16' [0-3]
int index = '2' [0-3]
MPI_Comm * comm = '' [0-3]
locals
MPI_Comm comm1 = '(MPI_Comm) ' [0-3]
int err = '0' [0-3]
int errsize = '-16' [0-3]
char [256] info_buf = '<value too long to display>' [0-3]
int size = '6348352' [0-3]
Original issue reported on code.google.com by [email protected]
on 4 Nov 2009 at 1:14
What steps will reproduce the problem?
Try and view a stack trace for a program when gdb isn't installed on the nodes.
Can you provide output showing the error?
[ashley@host src]$ ./padb -Ormgr=local -x 28433
Unexpected EOF from Inner stdout (live)
Unexpected EOF from Inner stderr (live)
Unexpected EOF from child socket (live)
Unexpected exit from parallel command (state=live)
Bad exit code from parallel command (exit_code=0)
Please provide any additional information below.
This used to exit cleanly, warning the user of the problem however the new
global_attach() function is called outside of the eval {} protecting the
critical parts of the code. The most likely fix is to add a eval around
global_attach and to save any errors so the code that later uses $gdb can call
die with the right error in a place where the eval command can catch it and
correctly report it.
Original issue reported on code.google.com by [email protected]
on 9 Jun 2010 at 10:49
We have an issue at BULL with padb and the release r463.
The patch r463 was for working with INTEL_MPI, but while testing with OpenMPI
(OMPI_COMM_WORLD_RANK) the patch regressed. In fact while searching for SLURM
processes with OMPI_COMM_WORLD_RANK, inner process is selected also as it got
PMI_RANK. So we have error such as:
einner: Error locating processes, refusing to debug self at /opt/bullxde/debuggers/padb/bin/padb line 9420
einner: main::register_target_process(0, 7791) called at /opt/bullxde/debuggers/padb/bin/padb line 9551
You can find attached a new patch against the patch of the release r463. This
patch allow to skip to the inner self process.
The patch is in the joined file: r464_15836.patch.
- This patch has been written by Thipadin Seng-Long -
Please let us know what do you think about this issue and the patch.
Thanks,
Original issue reported on code.google.com by [email protected]
on 12 Jan 2015 at 9:52
Attachments:
Hello,
Before the r452 fix, launching padb on 1024+ nodes gave the following errors:
====================
...
Waiting for signon from 2148 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line
10148
einner: main::inner_loop_for_comms('helios88:46869') called at
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line
10585
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line
10148
einner: main::inner_loop_for_comms('helios88:46869') called at
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line
10585
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line
10148
einner: main::inner_loop_for_comms('helios88:46869') called at
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line
10585
...
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Unexpected EOF from Inner stdout (connecting)
Unexpected EOF from Inner stderr (connecting)
Waiting for signon from 1560 hosts.
Unexpected exit from parallel command (state=connecting)
Bad exit code from parallel command (exit_code=110)
====================
Following the fix you made in r452, there is no more wait issues but we still
have the "Unexpected EOF" errors:
====================
Unexpected EOF from child socket (live)
Unexpected EOF from Inner stdout (live)
Unexpected EOF from Inner stderr (live)
====================
Reproducing this issue at such a scale might not be easy. Please feel free to
ask me for additionnal tests to run on our system.
Thanks.
Original issue reported on code.google.com by [email protected]
on 18 Feb 2014 at 3:25
What steps will reproduce the problem?
I've had reports that padb is unable to to even find the debugger dll when
running against a 32 bit target on a 64 bit host.
Can you provide output showing the error?
No, it appears fetch_string() wasn't returning a string and hence dll was
being used un-modified.
Please provide any additional information below.
In this case the user recompiled minfo.c with -m32 to match their MPI library.
SuSE 10, 64 bit kernel, 32 bit app, 64 bit gdb, 32 bit minfo.
Original issue reported on code.google.com by [email protected]
on 11 Dec 2009 at 9:48
What steps will reproduce the problem?
padb version 3.3 (Revision 426), intelmpi4.0.3/4.1.0
- comamnd line : padb -O rmgr=slurm --stack-trace --tree <jobid>
- frequency : always
Can you provide output showing the error?
Please provide any additional information below.
I have attached a patch that solves the issue.
Thanks.
Original issue reported on code.google.com by [email protected]
on 10 Feb 2014 at 3:26
Attachments:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.