Comments (13)
Thank you, I will catch the situation and try to check connection either with nc or telnet. Will get back to you once I have logs/facts!
from gpdb.
Hello!
We've investigated on our side and found out that we faced an overflow in net connections table which is regulated by kernel parameter net.netfilter.nf_conntrack_max. Since we have heavy load on GP, having 100+ segments, it seems that this parameter capacity was not enough to withhold all the demand for IP connections.
Hence currently we raised thus param to 5898240 and errors were gone.
When we had it set to 2097152 it was not enough.
Under heavy load this error may leave traces in syslog similar to: w13 kernel: [1012462.328789] nf_conntrack: table full, dropping packet
Perhaps it could be good to put this parameter with elevated value as a recommended to the "Recommended OS Parameters page"
One other option to propose is to create a diagnostic tool which could check all kernel parameters and user limits and give a report on which parameters could be adjusted for GP to perform best.
@interma thank you so much for quickly jump in and directing me where to look for solution. Currently we understand this is definitely not a GP bug, but OS settings issue.
from gpdb.
Do you have coredump that can be shared here?
Or can you please use gdb to read the error data in the core?
from gpdb.
I tried examining process nearly to break with perf utility, however it did not give much details, all looks standard, except for the wile looking at main MPPEXEC select process I received below messages:
? ( ): ... [continued]: select()) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
0.020 (500.339 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
500.377 (500.582 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
1000.978 (500.376 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
1501.371 (500.587 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
2001.979 (500.340 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout)
select (/lib/x86_64-linux-gnu/libc-2.23.so)
format_sockaddr (/usr/local/gpdb/bin/postgres)
In terms of using gdb - could you recomment any specific arguments to run it and investigate efficiently?
from gpdb.
I tried examining process nearly to break with perf utility, however it did not give much details, all looks standard, except for the wile looking at main MPPEXEC select process I received below messages:
? ( ): ... [continued]: select()) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres) 0.020 (500.339 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres) 500.377 (500.582 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres) 1000.978 (500.376 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres) 1501.371 (500.587 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres) 2001.979 (500.340 ms): select(n: 202, inp: 0x7ffd4c6bd530, tvp: 0x7ffd4c6bd520) = 0 (Timeout) select (/lib/x86_64-linux-gnu/libc-2.23.so) format_sockaddr (/usr/local/gpdb/bin/postgres)
In terms of using gdb - could you recomment any specific arguments to run it and investigate efficiently?
ERRORDATA_STACK_SIZE
should PANIC and generate core dumps in your OS if you enabled it.
core dump path is recorded in the kernel parameter: /proc/sys/kernel/core_pattern
, you can read the path by cat /proc/sys/kernel/core_pattern
and then go to the directory to see if there is core dump.
Besides in the errored segment log, you should find LOGs containing call stacks, this can be used to double confirm the core (with pid in the LOG).
Now suppose you have got the coredump:
- source
greenplum_path.sh
in your OS which postgres
, this command will tell you the full path ofpostgres
binarygdb <full path of postgres binary> -c <core_dump_file_path>
this is use gdb to look into the core dump- when reach here, you should be inside the gdb:
bt
to show the backtracep errordata[0]
to show the top element in error stackp errordata[1]
andp errordata[2]
, .....p errordata[6]
to show some more
with all the above information (LOGS and GDB coredump analysis output) we might better understand the problem.
If you core file is share-able, you can use packcore
to pack it and send it to us with GPDB LOGs.
from gpdb.
this is an output from some of the gdb commands:
(gdb) info stack
#0 0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000000000b5fdee in ?? ()
#2 0x0000000000b632ac in TeardownTCPInterconnect ()
#3 0x0000000000b5dad9 in TeardownInterconnect ()
#4 0x00000000008923ff in mppExecutorFinishup ()
#5 0x000000000087f848 in standard_ExecutorEnd ()
#6 0x0000000000830a88 in PortalCleanup ()
#7 0x0000000000b270ea in PortalDrop ()
#8 0x00000000009f22a7 in ?? ()
#9 0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()
(gdb) info program
Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.
(gdb) info program
Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.
(gdb) info program
Using the running image of attached Thread 0x7f0361606740 (LWP 678676).
Program stopped at 0x7f035e72c6b3.
Type "info stack" or "info registers" for more information.
(gdb) info stack
#0 0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000000000b5fdee in ?? ()
#2 0x0000000000b632ac in TeardownTCPInterconnect ()
#3 0x0000000000b5dad9 in TeardownInterconnect ()
#4 0x00000000008923ff in mppExecutorFinishup ()
#5 0x000000000087f848 in standard_ExecutorEnd ()
#6 0x0000000000830a88 in PortalCleanup ()
#7 0x0000000000b270ea in PortalDrop ()
#8 0x00000000009f22a7 in ?? ()
#9 0x00000000009f57b8 in PostgresMain ()
#10 0x00000000006b583e in ?? ()
#11 0x0000000000990309 in PostmasterMain ()
#12 0x00000000006b839b in main ()
(gdb) info registers
rax 0xfffffffffffffdfe -514
rbx 0x4 4
rcx 0x7f035e72c6b3 139652446209715
rdx 0x0 0
rsi 0x7fff9565d320 140735699866400
rdi 0x8b 139
rbp 0x1 0x1
rsp 0x7fff9565b2d8 0x7fff9565b2d8
r8 0x7fff9565b300 140735699858176
r9 0x7fff9565b2b0 140735699858096
r10 0x0 0
r11 0x246 582
r12 0x3480cf8 55053560
r13 0x3480cf0 55053552
r14 0x80 128
r15 0x8a 138
rip 0x7f035e72c6b3 0x7f035e72c6b3 <select+19>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
k0 0x0 0
k1 0x0 0
k2 0x0 0
k3 0x0 0
k4 0x0 0
k5 0x0 0
k6 0x0 0
k7 0x0 0
(gdb) info frame
Stack level 0, frame at 0x7fff9565b2e0:
rip = 0x7f035e72c6b3 in select; saved rip = 0xb5fdee
called by frame at 0x7fff9565f370
Arglist at 0x7fff9565b2d0, args:
Locals at 0x7fff9565b2d0, Previous frame's sp is 0x7fff9565b2e0
Saved registers:
rip at 0x7fff9565b2d8
from gpdb.
this is an output from some of the gdb commands:
(gdb) info stack #0 0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000000000b5fdee in ?? () #2 0x0000000000b632ac in TeardownTCPInterconnect () #3 0x0000000000b5dad9 in TeardownInterconnect () #4 0x00000000008923ff in mppExecutorFinishup () #5 0x000000000087f848 in standard_ExecutorEnd () #6 0x0000000000830a88 in PortalCleanup () #7 0x0000000000b270ea in PortalDrop () #8 0x00000000009f22a7 in ?? () #9 0x00000000009f57b8 in PostgresMain () #10 0x00000000006b583e in ?? () #11 0x0000000000990309 in PostmasterMain () #12 0x00000000006b839b in main () (gdb) info program Using the running image of attached Thread 0x7f0361606740 (LWP 678676). Program stopped at 0x7f035e72c6b3. Type "info stack" or "info registers" for more information. (gdb) info program Using the running image of attached Thread 0x7f0361606740 (LWP 678676). Program stopped at 0x7f035e72c6b3. Type "info stack" or "info registers" for more information. (gdb) info program Using the running image of attached Thread 0x7f0361606740 (LWP 678676). Program stopped at 0x7f035e72c6b3. Type "info stack" or "info registers" for more information. (gdb) info stack #0 0x00007f035e72c6b3 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000000000b5fdee in ?? () #2 0x0000000000b632ac in TeardownTCPInterconnect () #3 0x0000000000b5dad9 in TeardownInterconnect () #4 0x00000000008923ff in mppExecutorFinishup () #5 0x000000000087f848 in standard_ExecutorEnd () #6 0x0000000000830a88 in PortalCleanup () #7 0x0000000000b270ea in PortalDrop () #8 0x00000000009f22a7 in ?? () #9 0x00000000009f57b8 in PostgresMain () #10 0x00000000006b583e in ?? () #11 0x0000000000990309 in PostmasterMain () #12 0x00000000006b839b in main () (gdb) info registers rax 0xfffffffffffffdfe -514 rbx 0x4 4 rcx 0x7f035e72c6b3 139652446209715 rdx 0x0 0 rsi 0x7fff9565d320 140735699866400 rdi 0x8b 139 rbp 0x1 0x1 rsp 0x7fff9565b2d8 0x7fff9565b2d8 r8 0x7fff9565b300 140735699858176 r9 0x7fff9565b2b0 140735699858096 r10 0x0 0 r11 0x246 582 r12 0x3480cf8 55053560 r13 0x3480cf0 55053552 r14 0x80 128 r15 0x8a 138 rip 0x7f035e72c6b3 0x7f035e72c6b3 <select+19> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 k0 0x0 0 k1 0x0 0 k2 0x0 0 k3 0x0 0 k4 0x0 0 k5 0x0 0 k6 0x0 0 k7 0x0 0 (gdb) info frame Stack level 0, frame at 0x7fff9565b2e0: rip = 0x7f035e72c6b3 in select; saved rip = 0xb5fdee called by frame at 0x7fff9565f370 Arglist at 0x7fff9565b2d0, args: Locals at 0x7fff9565b2d0, Previous frame's sp is 0x7fff9565b2e0 Saved registers: rip at 0x7fff9565b2d8
Can you print errordata
as I sugguested in preivous comments?
from gpdb.
I tried printing errordata as well as errordata[0] and arrordata[1], etc
However the result is as follows:
(gdb) p errordata
No symbol table is loaded. Use the "file" command.
(gdb) p errordata[0]
No symbol table is loaded. Use the "file" command.
Could you advise if I need to apply some flag when configure or build gpdb to make symbols included?
Another possibility - maybe I can get a Symbols file from somewhere to supply it to gdb separately (with --symbols=SYMFILE Read symbols from SYMFILE.)?
In fact we tried to b uild gpdb with corefiles enabled, and it caused gpdb to produce coredump and restart cluster on any non-significant error, which was not acceptable even for TEST environment..
from gpdb.
I was able to build GP postgres executable with symbols and now waiting for suitable process to examine
from gpdb.
What I was able to discover further:
- In most cases situation develops as follows: on one of the calculation steps master processes next step, e.g.: 2024-03-15 12:37:34.172218 ,"gpadmin","test",p381664,th1973114688,"client_ip","55084",2024-03-15 12:37:18 ,4058903,con165168,cmd989,seg-1,,dx1191684,x4058903,sx1,"LOG","00000","statement: CREATE TEMPORARY TABLE ""tmp_windows_for_join""
- segments try to establish connection to each other: 2024-03-15 12:37:19.737536,"gpadmin","test",p1028240,th-1930782912,"master_ip","16174",2024-03-15 12:37:19,0,con165168,,seg19,,,,sx1,"LO
G","00000","connection authorized: user=gpadmin database=test",,,,,,,0,,"postinit.c",329,
2024-03-15 12:37:38.207942,"gpadmin","test",p1028215,th-1930782912,"master_ip","16150",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
9,slice3,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000ms 4020 e
lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 12:37:38.207976,"gpadmin","test",p1028224,th-1930782912,"master_ip","16162",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
9,slice2,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg104 [seg_m_IP]:54311 from local port was not complete after 4000ms 4020
elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 12:37:38.208025,"gpadmin","test",p1028240,th-1930782912,"master_ip","16174",2024-03-15 12:37:19,3670769,con165168,cmd990,seg1
9,slice1,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_p_IP]:35825 from local port was not complete after 4000ms 4020 e
lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 12:37:42.208697,"gpadmin","test",p47265,th-63346880,"master_ip","16434",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice2,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg114 [seghost_e_ip]:12007 from local port was not complete after 8000ms 8020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 12:37:42.208740,"gpadmin","test",p47265,th-63346880,"master_ip","16434",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice2,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg119 [seghost_g_ip]:53997 from local port was not complete after 8000ms 8020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 12:37:46.207520,"gpadmin","test",p47256,th-63346880,"master_ip","16432",2024-03-15 12:37:19,3672943,con165168,cmd990,seg86,slice3,dx1191684,x3672943,sx1,"LOG","58M01","Interconnect timeout: Connection to seg108 [seghost_h_ip]:32049 from local port was not complete after 12000ms 12020 elapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join"" - After that - nothing and 1 hour later:
2024-03-15 13:38:00.370391,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,dx1191684,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 13:38:00.370453,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,dx1191684,x3672943,sx1,"LOG","00000","An exception was encountered during the execution of statement: CREATE TEMPORARY TABLE ""tmp_windows_for_join""
2024-03-15 13:38:00.370510,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370518,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370544,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370586,"gpadmin","test",p47224,th-63346880,"master_ip","16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370628,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370699,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370759,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.370811,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources.",,,,,,,0,,"ic_tcp.c",2041,
2024-03-15 13:38:00.384248,"gpadmin","test",p47224,th-63346880,"master_ip",,"16400",2024-03-15 12:37:18,3672943,con165168,cmd990,seg86,slice4,,x3672943,sx1,"PANIC","XX000","ERRORDATA_STACK_SIZE exceeded (elog.c:1679)",,,,,,,0,,"elog.c",1679,"Stack trace:
I've attached with gdb to all of involved process and got errordata[0] from it:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f9958cf181d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0
and backtrace:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f30f011581d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0
#0 0x00007f30f011581d in recv () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00000000008cd414 in secure_read ()
#2 0x00000000008d69cb in ?? ()
#3 0x00000000008d7b45 in pq_getbyte ()
#4 0x00000000009f4150 in PostgresMain ()
#5 0x00000000006b583e in ?? ()
#6 0x0000000000990309 in PostmasterMain ()
#7 0x00000000006b839b in main ()
For some other cases errordata[i] returned the following:
0x00007f02b575481d in ?? ()
$1 = {elevel = 0, output_to_server = 0 '\000', output_to_client = 0 '\000', show_funcname = 0 '\000', omit_location = 0 '\000', fatal_return = 0 '\000', hide_stmt = 0 '\000', filename = 0x0, lineno = 0, funcnam
e = 0x0, domain = 0x0, context_domain = 0x0, sqlerrcode = 0, message = 0x0, detail = 0x0, detail_log = 0x0, hint = 0x0, context = 0x0, schema_name = 0x0, table_name = 0x0, column_name = 0x0, datatype_name = 0x0
, constraint_name = 0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, stacktracearray = {0x0 <repeats 30 times>}, stacktracesize = 0, printstack = 0 '\000', assoc_context = 0x0, hide_ct
x = 0 '\000'}
from gpdb.
One another different behaviour which was observed is:
One of the segments give out error: 2024-03-15 15:49:11.488368,"gpadmin","test",p379573,th-63346880,"master_ip","62582",2024-03-15 15:47:23,3687423,con176373,cmd659,seg86,s
lice4,dx1244463,x3687423,sx1,"LOG","00000","TeardownTCPInterconnect: waitOnOutbound recv: Connection reset by peer",,,,,,"CREATE TEMPORARY TABLE ""tmp_normalizing_windows""
Other segments produce same kind of staff that they try to connect to others and they do not respond, and then "ERROR","58M01","Interconnect Error: Unexpected Motion Node Id: 3 (size 10). This means a motion node that wasn't setup is requesting interconnect resources." and then "PANIC","XX000","ERRORDATA_STACK_SIZE exceeded (elog.c:1679)",
from gpdb.
Hi @sokolval
Based on your provided logs, some of them are notable:
2024-03-15 12:37:38.207942,"gpadmin","test",p1028215,th-1930782912,"master_ip","16150",2024-03-15 12:37:19,3670769,con165168,cmd990,seg19,slice3,dx1191684,x3670769,sx1,"LOG","58M01","Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000ms 4020 e
lapsed. Will retry.",,,,,,"CREATE TEMPORARY TABLE ""tmp_windows_for_join""
// still timeout after 8000/12000ms
The related code is in SetupTCPInterconnect()
and setupOutgoingConnection()
, the logic is very simple:
Just try a tcp connect.
So, Interconnect timeout: Connection to seg78 [seg_n_IP]:41127 from local port was not complete after 4000/8000/12000/...ms
means master cannot tcp connect to seg78.
Is it possible to have some network failure in your env? And when it reproduces, you can immediately do a manual try (e.g. using nc
command) between the host pair to see if can establish a tcp connection.
from gpdb.
Currently we understand this is definitely not a GP bug, but OS settings issue.
Close this Issue.
from gpdb.
Related Issues (20)
- How to read the storage files of an AO table and recover the data as much as possible. HOT 4
- Memory pool is necessary! HOT 3
- The global deadlock detected process of the greenplum master node has abnormal RSS memory usage. HOT 29
- Incorrect version in configure.in file. HOT 2
- could not access status of transaction 70523908 (slru.c:896) HOT 1
- How Greenplum views the amount of data exchanged between cross data nodes in motion in a distributed execution plan๏ผ HOT 2
- ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list HOT 1
- GPDB7 REGRESSION: External partitions isn't working as documented or as gpdb6 worked HOT 5
- REINDEX the primary key result in a key conflict issue HOT 2
- query crash when contain multiple shareinput in qd slice HOT 1
- error in unit tests when built with coverage enabled.
- Issue with CXformPushJoinBelowUnionAll Impacting Compilation Time HOT 3
- MQDA plan by Postgres-based planner is not correct when all group-by keys are constants HOT 2
- can't use index HOT 2
- assert failure when insert AO with gin index. HOT 1
- Planner delete with "dedup" semi-join on replicate table may cause ERROR HOT 7
- Segmentation fault on jsonb_array_elements with ordinality HOT 2
- Consistency check on SPI tuple count failed when CTE RETURNING is used with INTO DISTRIBUTED REPLICATED table HOT 3
- Non-union op will raise error for replicated locus and partitioned locus.
- gp_acquire_sample_rows failed unless acquire lock in advance HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpdb.