Giter Site home page Giter Site logo

Comments (13)

sokolval avatar sokolval commented on June 3, 2024 1

Thank you, I will catch the situation and try to check connection either with nc or telnet. Will get back to you once I have logs/facts!

from gpdb.

sokolval avatar sokolval commented on June 3, 2024 1

Hello!
We've investigated on our side and found out that we faced an overflow in net connections table which is regulated by kernel parameter net.netfilter.nf_conntrack_max. Since we have heavy load on GP, having 100+ segments, it seems that this parameter capacity was not enough to withhold all the demand for IP connections.
Hence currently we raised thus param to 5898240 and errors were gone.
When we had it set to 2097152 it was not enough.
Under heavy load this error may leave traces in syslog similar to: w13 kernel: [1012462.328789] nf_conntrack: table full, dropping packet

Perhaps it could be good to put this parameter with elevated value as a recommended to the "Recommended OS Parameters page"

One other option to propose is to create a diagnostic tool which could check all kernel parameters and user limits and give a report on which parameters could be adjusted for GP to perform best.

@interma thank you so much for quickly jump in and directing me where to look for solution. Currently we understand this is definitely not a GP bug, but OS settings issue.

from gpdb.

kainwen avatar kainwen commented on June 3, 2024

Do you have coredump that can be shared here?
Or can you please use gdb to read the error data in the core?

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

I tried examining process nearly to break with perf utility, however it did not give much details, all looks standard, except for the wile looking at main MPPEXEC select process I received below messages:

         ? (         ):  ... [continued]: select())                                           = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
     0.020 (500.339 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
   500.377 (500.582 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  1000.978 (500.376 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  1501.371 (500.587 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  2001.979 (500.340 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)

In terms of using gdb - could you recomment any specific arguments to run it and investigate efficiently?

from gpdb.

kainwen avatar kainwen commented on June 3, 2024

I tried examining process nearly to break with perf utility, however it did not give much details, all looks standard, except for the wile looking at main MPPEXEC select process I received below messages:

         ? (         ):  ... [continued]: select())                                           = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
     0.020 (500.339 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
   500.377 (500.582 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  1000.978 (500.376 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  1501.371 (500.587 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)
  2001.979 (500.340 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520)              = 0 (Timeout)
                                       select (/lib/x86_64-linux-gnu/libc-2.23.so)
                                       format_sockaddr (/usr/local/gpdb/bin/postgres)

In terms of using gdb - could you recomment any specific arguments to run it and investigate efficiently?

@sokolval

ERRORDATA_STACK_SIZE should PANIC and generate core dumps in your OS if you enabled it.
core dump path is recorded in the kernel parameter: /proc/sys/kernel/core_pattern, you can read the path by cat /proc/sys/kernel/core_pattern and then go to the directory to see if there is core dump.

Besides in the errored segment log, you should find LOGs containing call stacks, this can be used to double confirm the core (with pid in the LOG).

Now suppose you have got the coredump:

  1. source greenplum_path.sh in your OS
  2. which postgres, this command will tell you the full path of postgres binary
  3. gdb <full path of postgres binary> -c <core_dump_file_path> this is use gdb to look into the core dump
  4. when reach here, you should be inside the gdb:
    • bt to show the backtrace
    • p errordata[0] to show the top element in error stack
    • p errordata[1] and p errordata[2] , ..... p errordata[6] to show some more

with all the above information (LOGS and GDB coredump analysis output) we might better understand the problem.

If you core file is share-able, you can use packcore to pack it and send it to us with GPDB LOGs.

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

this is an output from some of the gdb commands:

(gdb) info stack
#0  0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000b5fdee in ?? ()
#2  0x0000000000b632ac in TeardownTCPInterconnect ()
#3  0x0000000000b5dad9 in TeardownInterconnect ()
#4  0x00000000008923ff in mppExecutorFinishup ()
#5  0x000000000087f848 in standard_ExecutorEnd ()
#6  0x0000000000830a88 in PortalCleanup ()
#7  0x0000000000b270ea in PortalDrop ()
#8  0x00000000009f22a7 in ?? ()
#9  0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info stack
#0  0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000b5fdee in ?? ()
#2  0x0000000000b632ac in TeardownTCPInterconnect ()
#3  0x0000000000b5dad9 in TeardownInterconnect ()
#4  0x00000000008923ff in mppExecutorFinishup ()
#5  0x000000000087f848 in standard_ExecutorEnd ()
#6  0x0000000000830a88 in PortalCleanup ()
#7  0x0000000000b270ea in PortalDrop ()
#8  0x00000000009f22a7 in ?? ()
#9  0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()

(gdb) info registers
rax            0xfffffffffffffdfe	-514
rbx            0x4	4
rcx            0x7f035e72c6b3	139652446209715
rdx            0x0	0
rsi            0x7fff9565d320	140735699866400
rdi            0x8b	139
rbp            0x1	0x1
rsp            0x7fff9565b2d8	0x7fff9565b2d8
r8             0x7fff9565b300	140735699858176
r9             0x7fff9565b2b0	140735699858096
r10            0x0	0
r11            0x246	582
r12            0x3480cf8	55053560
r13            0x3480cf0	55053552
r14            0x80	128
r15            0x8a	138
rip            0x7f035e72c6b3	0x7f035e72c6b3 <select+19>
eflags         0x246	[ PF ZF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
k0             0x0	0
k1             0x0	0
k2             0x0	0
k3             0x0	0
k4             0x0	0
k5             0x0	0
k6             0x0	0
k7             0x0	0

(gdb) info frame
Stack level 0, frame at 0x7fff9565b2e0:
 rip = 0x7f035e72c6b3 in select; saved rip = 0xb5fdee
 called by frame at 0x7fff9565f370
 Arglist at 0x7fff9565b2d0, args: 
 Locals at 0x7fff9565b2d0, Previous frame's sp is 0x7fff9565b2e0
 Saved registers:
  rip at 0x7fff9565b2d8

from gpdb.

kainwen avatar kainwen commented on June 3, 2024

this is an output from some of the gdb commands:

(gdb) info stack
#0  0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000b5fdee in ?? ()
#2  0x0000000000b632ac in TeardownTCPInterconnect ()
#3  0x0000000000b5dad9 in TeardownInterconnect ()
#4  0x00000000008923ff in mppExecutorFinishup ()
#5  0x000000000087f848 in standard_ExecutorEnd ()
#6  0x0000000000830a88 in PortalCleanup ()
#7  0x0000000000b270ea in PortalDrop ()
#8  0x00000000009f22a7 in ?? ()
#9  0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info program
	Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.

(gdb) info stack
#0  0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000b5fdee in ?? ()
#2  0x0000000000b632ac in TeardownTCPInterconnect ()
#3  0x0000000000b5dad9 in TeardownInterconnect ()
#4  0x00000000008923ff in mppExecutorFinishup ()
#5  0x000000000087f848 in standard_ExecutorEnd ()
#6  0x0000000000830a88 in PortalCleanup ()
#7  0x0000000000b270ea in PortalDrop ()
#8  0x00000000009f22a7 in ?? ()
#9  0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()

(gdb) info registers
rax            0xfffffffffffffdfe	-514
rbx            0x4	4
rcx            0x7f035e72c6b3	139652446209715
rdx            0x0	0
rsi            0x7fff9565d320	140735699866400
rdi            0x8b	139
rbp            0x1	0x1
rsp            0x7fff9565b2d8	0x7fff9565b2d8
r8             0x7fff9565b300	140735699858176
r9             0x7fff9565b2b0	140735699858096
r10            0x0	0
r11            0x246	582
r12            0x3480cf8	55053560
r13            0x3480cf0	55053552
r14            0x80	128
r15            0x8a	138
rip            0x7f035e72c6b3	0x7f035e72c6b3 <select+19>
eflags         0x246	[ PF ZF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
k0             0x0	0
k1             0x0	0
k2             0x0	0
k3             0x0	0
k4             0x0	0
k5             0x0	0
k6             0x0	0
k7             0x0	0

(gdb) info frame
Stack level 0, frame at 0x7fff9565b2e0:
 rip = 0x7f035e72c6b3 in select; saved rip = 0xb5fdee
 called by frame at 0x7fff9565f370
 Arglist at 0x7fff9565b2d0, args: 
 Locals at 0x7fff9565b2d0, Previous frame's sp is 0x7fff9565b2e0
 Saved registers:
  rip at 0x7fff9565b2d8

Can you print errordata as I sugguested in preivous comments?

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

I tried printing errordata as well as errordata[0] and arrordata[1], etc
However the result is as follows:

(gdb) p errordata
No symbol table is loaded.  Use the "file" command.
(gdb) p errordata[0]
No symbol table is loaded.  Use the "file" command.

Could you advise if I need to apply some flag when configure or build gpdb to make symbols included?
Another possibility - maybe I can get a Symbols file from somewhere to supply it to gdb separately (with --symbols=SYMFILE Read symbols from SYMFILE.)?

In fact we tried to b uild gpdb with corefiles enabled, and it caused gpdb to produce coredump and restart cluster on any non-significant error, which was not acceptable even for TEST environment..

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

I was able to build GP postgres executable with symbols and now waiting for suitable process to examine

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

What I was able to discover further:

  1. In most cases situation develops as follows: on one of the calculation steps master processes next step, e.g.: 2024-03-15 12:37:34.172218 ,"gpadmin","test",p381664,th1973114688,"client_ip","55084",2024-03-15 12:37:18 ,4058903,con165168,cmd989,seg-1,,dx1191684,x4058903,sx1,"LOG","00000","statement: CREATE TEMPORARY TABLE ""tmp_windows_for_join""
  2. segments try to establish connection to each other: 2024-03-15 12:37:19.737536,"gpadmin","test",p1028240,th-1930782912,"master_ip","16174",2024-03-15 12:37:19,0,con165168,,seg19,,,,sx1,"LO
    G","00000","connection authorized: user=gpadmin database=test",,,,,,,0,,"postinit.c",329,
    2024-03-15 12:37:38.207942,"gpadmin","test",p1028215,th-1930782912,"master_ip","16150",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
    9,slice3,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000ms 4020 e
    lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 12:37:38.207976,"gpadmin","test",p1028224,th-1930782912,"master_ip","16162",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
    9,slice2,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg104 [seg_m_IP]:54311 from local port was not complete after 4000ms 4020
    elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 12:37:38.208025,"gpadmin","test",p1028240,th-1930782912,"master_ip","16174",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
    9,slice1,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_p_IP]:35825 from local port was not complete after 4000ms 4020 e
    lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 12:37:42.208697,"gpadmin","test",p47265,th-63346880,"master_ip","16434",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice2,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg114 [seghost_e_ip]:12007 from local port was not complete after 8000ms 8020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 12:37:42.208740,"gpadmin","test",p47265,th-63346880,"master_ip","16434",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice2,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg119 [seghost_g_ip]:53997 from local port was not complete after 8000ms 8020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 12:37:46.207520,"gpadmin","test",p47256,th-63346880,"master_ip","16432",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice3,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg108 [seghost_h_ip]:32049 from local port was not complete after 12000ms 12020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
  3. After that - nothing and 1 hour later:
    2024-03-15 13:38:00.370391,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,dx1191684,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 13:38:00.370453,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,dx1191684,x3672943,sx1,"LOG","00000","An exception was encountered during the execution of statement: CREATE TEMPORARY TABLE ""tmp_windows_for_join""
    2024-03-15 13:38:00.370510,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370518,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370544,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370586,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370628,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370699,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370759,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.370811,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
    2024-03-15 13:38:00.384248,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"PANIC","XX000","ERRORDATA_STACK_SIZE exceeded (elog.c:1679)",,,,,,,0,,"elog.c",1679,"Stack trace:

I've attached with gdb to all of involved process and got errordata[0] from it:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f9958cf181d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0

and backtrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f30f011581d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0
#0 0x00007f30f011581d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00000000008cd414 in secure_read ()
#2 0x00000000008d69cb in ?? ()
#3 0x00000000008d7b45 in pq_getbyte ()
#4 0x00000000009f4150 in PostgresMain ()
#5 0x00000000006b583e in ?? ()
#6 0x0000000000990309 in PostmasterMain ()
#7 0x00000000006b839b in main ()

For some other cases errordata[i] returned the following:
0x00007f02b575481d in ?? ()
$1 = {elevel = 0, output_to_server = 0 '\000', output_to_client = 0 '\000', show_funcname = 0 '\000', omit_location = 0 '\000', fatal_return = 0 '\000', hide_stmt = 0 '\000', filename = 0x0, lineno = 0, funcnam
e = 0x0, domain = 0x0, context_domain = 0x0, sqlerrcode = 0, message = 0x0, detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0
, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, stacktracearray = {0x0 <repeats 30 times>}, stacktracesize = 0, printstack = 0 '\000', assoc_context = 0x0, hide_ct
x = 0 '\000'}

from gpdb.

sokolval avatar sokolval commented on June 3, 2024

One another different behaviour which was observed is:
One of the segments give out error: 2024-03-15 15:49:11.488368,"gpadmin","test",p379573,th-63346880,"master_ip","62582",2024-03-15 15:47:23,3687423,con176373,cmd659,seg86,s
lice4,dx1244463,x3687423,sx1,"LOG","00000","TeardownTCPInterconnect: waitOnOutbound recv: Connection reset by peer",,,,,,"CREATE TEMPORARY TABLE ""tmp_normalizing_windows""

Other segments produce same kind of staff that they try to connect to others and they do not respond, and then "ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources." and then "PANIC","XX000","ERRORDATA_STACK_SIZE exceeded (elog.c:1679)",

from gpdb.

interma avatar interma commented on June 3, 2024

Hi @sokolval
Based on your provided logs, some of them are notable:

2024-03-15 12:37:38.207942,"gpadmin","test",p1028215,th-1930782912,"master_ip","16150",2024-03-15 12:37:19,3670769,con165168,cmd990,seg19,slice3,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000ms 4020 e
lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""

// still timeout after 8000/12000ms

The related code is in SetupTCPInterconnect() and setupOutgoingConnection(), the logic is very simple:
Just try a tcp connect.

So, Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000/8000/12000/...ms means master cannot tcp connect to seg78.

Is it possible to have some network failure in your env? And when it reproduces, you can immediately do a manual try (e.g. using nc command) between the host pair to see if can establish a tcp connection.

from gpdb.

interma avatar interma commented on June 3, 2024

Currently we understand this is definitely not a GP bug, but OS settings issue.

Close this Issue.

from gpdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.