Giter Site home page Giter Site logo

ashleypittman / padb Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 7.0 932 KB

Parallel Debugging tool for HPC applications

Home Page: http://padb.pittman.org.uk

License: GNU Lesser General Public License v2.1

Shell 0.01% Makefile 0.10% C 29.74% Perl 70.14%

padb's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

padb's Issues

Inability to use symbolic names for signals.

The --signal option take a number for a signal to deliver to processes
within the job, it does not however take symbolic names for signals which
makes it harder to use, particularly for people who don't know the signal
numbers.

The following command could reasonably be expected to work.
ashley@alpha:~$ padb -a --kill --signal STOP
Value "STOP" invalid for option signal (number expected)
ashley@alpha:~$

Original issue reported on code.google.com by [email protected] on 9 Jun 2009 at 9:55

Unclean error report when gdb not installed.

Attempting to run padb --stack-trace on nodes which don't have gdb
installed results in a unclear error, this case should be handled better.

ashley@fnarp:~$ padb -a -x -t
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Error (55079,stack): Can't use string ("unknown error") as a HASH ref while
"strict refs" in use at /home/ashley/padb/padb/src/padb line 5158.
Failed to run parallel command (rc = 9)
Can't use an undefined value as an ARRAY reference at
/home/ashley/padb/padb/src/padb line 1899.

Original issue reported on code.google.com by [email protected] on 15 Jun 2009 at 2:43

Formatting padb output for html.

When showing variables in stack traces some variables cannot or are not
shown with their real, native value.  Up until r319 these variables were
reported with as as string encased in pointy braces.  This does not play
well with html rendering however which removes the braces and everything
inside them (and has other potentially bad consequences).

r319 removes these braces leaving just the string which allows the values
to be printed on web pages but a better solution is needed for the long
term, perhaps a --format-for-html option?

-----------------
[0-3] (4 processes)
-----------------
main() at MPI_Allgather_c.c:123
      params
        int     argc = '1' [0-3]
        char ** argv: <more than 3 distinct values>
      locals
        int              byte_length = '65536' [0-3]
        MPI_Comm                comm = '<MPI_COMM_NULL>' [0-3]
        int               comm_count = '2' [0-3]
        int               comm_index = '2' [0-3]
        int                comm_type = '-16' [0-3]
        int *                 counts: <more than 3 distinct values>
        int *                 displs: <more than 3 distinct values>
        int                    error = '0' [0-3]
        int                     fail = '0' [0-3]
        int                        i:
            '1' [0]
            '4' [1-3]
        int                     ierr = '0' [0-3]
        char [256]          info_buf = '<value too long to display>' [0-3]
        int               inter_flag = '0' [0-3]
        int                        j = '0' [0-3]
        int                   length = '113' [0-3]
        int             length_count = '18' [0-3]
        int                 loop_cnt:
            '234' [1-3]
            '468' [0]
        int          max_byte_length = '65536' [0-3]
        int               max_length = '113' [0-3]
        void *           recv_buffer = '<valid pointer perm=rw-p>' [0-3]
        void *           send_buffer = '<valid pointer perm=rw-p>' [0-3]
        int                     size = '0' [0-3]
        int                test_nump = '1' [0-3]
        int                test_type = '13' [0-3]
        char [128]          testname = '<value too long to display>' [0-3]
        int               type_count = '13' [0-3]
        struct dataTemplate    value = '<value too long to display>' [0-3]
        struct dataTemplate * values = '<valid pointer perm=rw-p>' [0-3]
  -----------------
  [0-3] (4 processes)
  -----------------
  MPITEST_get_communicator() at libmpitest.c:3956
        params
          int     context = '-16' [0-3]
          int       index = '2' [0-3]
          MPI_Comm * comm = '' [0-3]
        locals
          MPI_Comm      comm1 = '(MPI_Comm) ' [0-3]
          int             err = '0' [0-3]
          int         errsize = '-16' [0-3]
          char [256] info_buf = '<value too long to display>' [0-3]
          int            size = '6348352' [0-3]

Original issue reported on code.google.com by [email protected] on 4 Nov 2009 at 1:14

Unclean exit when gdb not installed.

What steps will reproduce the problem?
Try and view a stack trace for a program when gdb isn't installed on the nodes.

Can you provide output showing the error?
[ashley@host src]$ ./padb -Ormgr=local -x 28433
Unexpected EOF from Inner stdout (live)
Unexpected EOF from Inner stderr (live)
Unexpected EOF from child socket (live)
Unexpected exit from parallel command (state=live)
Bad exit code from parallel command (exit_code=0)

Please provide any additional information below.
This used to exit cleanly, warning the user of the problem however the new 
global_attach() function is called outside of the eval {} protecting the 
critical parts of the code.  The most likely fix is to add a eval around 
global_attach and to save any errors so the code that later uses $gdb can call 
die with the right error in a place where the eval command can catch it and 
correctly report it.

Original issue reported on code.google.com by [email protected] on 9 Jun 2010 at 10:49

Issue with padb r463: Error locating processes, refusing to debug self at ...

We have an issue at BULL with padb and the release r463.

The patch r463 was for working with INTEL_MPI, but while testing with OpenMPI 
(OMPI_COMM_WORLD_RANK) the patch regressed. In fact while searching for SLURM 
processes with OMPI_COMM_WORLD_RANK, inner process is selected also as it got 
PMI_RANK. So we have error such as:

 einner: Error locating processes, refusing to debug self at /opt/bullxde/debuggers/padb/bin/padb line 9420
 einner:                 main::register_target_process(0, 7791) called at /opt/bullxde/debuggers/padb/bin/padb line 9551


You can find attached a new patch against the patch of the release r463. This 
patch allow to skip to the inner self process.

The patch is in the joined file: r464_15836.patch.

 - This patch has been written by Thipadin Seng-Long -

Please let us know what do you think about this issue and the patch.

Thanks,


Original issue reported on code.google.com by [email protected] on 12 Jan 2015 at 9:52

Attachments:

Unexpected EOF from child socket (live)

Hello,

Before the r452 fix, launching padb on 1024+ nodes gave the following errors:

====================
...
Waiting for signon from 2148 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line 
10148
einner: main::inner_loop_for_comms('helios88:46869') called at 
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line 
10585
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line 
10148
einner: main::inner_loop_for_comms('helios88:46869') called at 
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line 
10585
einner: Failed to connect to outer at /opt/bullxde/debuggers/padb/bin/padb line 
10148
einner: main::inner_loop_for_comms('helios88:46869') called at 
/opt/bullxde/debuggers/padb/bin/padb line 10285
einner: main::inner_main() called at /opt/bullxde/debuggers/padb/bin/padb line 
10585
...
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Waiting for signon from 1560 hosts.
Unexpected EOF from Inner stdout (connecting)
Unexpected EOF from Inner stderr (connecting)
Waiting for signon from 1560 hosts.
Unexpected exit from parallel command (state=connecting)
Bad exit code from parallel command (exit_code=110)
====================

Following the fix you made in r452, there is no more wait issues but we still 
have the  "Unexpected EOF" errors:

====================
Unexpected EOF from child socket (live)
Unexpected EOF from Inner stdout (live)
Unexpected EOF from Inner stderr (live)
====================

Reproducing this issue at such a scale might not be easy. Please feel free to 
ask me for additionnal tests to run on our system.

Thanks.

Original issue reported on code.google.com by [email protected] on 18 Feb 2014 at 3:25

MPI message queue plugin not working for 32bit MPI on 64bit host

What steps will reproduce the problem?
I've had reports that padb is unable to to even find the debugger dll when
running against a 32 bit target on a 64 bit host.

Can you provide output showing the error?
No, it appears fetch_string() wasn't returning a string and hence dll was
being used un-modified.

Please provide any additional information below.
In this case the user recompiled minfo.c with -m32 to match their MPI library.
SuSE 10, 64 bit kernel, 32 bit app, 64 bit gdb, 32 bit minfo.

Original issue reported on code.google.com by [email protected] on 11 Dec 2009 at 9:48

padb not working with intelmpi when mpirun is used

What steps will reproduce the problem?

padb version 3.3 (Revision 426), intelmpi4.0.3/4.1.0
- comamnd line : padb -O rmgr=slurm --stack-trace --tree <jobid>
- frequency : always

Can you provide output showing the error?


Please provide any additional information below.

I have attached a patch that solves the issue.

Thanks.

Original issue reported on code.google.com by [email protected] on 10 Feb 2014 at 3:26

Attachments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.