We are using Kadalu(1.1.0) for one of our usecase with KadaluStorage-Replica-3 As

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Some more info collected from the FUSE client logs: <div class="snippet-clipboard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[Bug]: In Replica3 KadaluStorage FUSE is showing subdir as TranportEndpointNotConnected,about kadalu/kadalu

Comments (9)

amarts commented on June 2, 2024 4

@mohammaddawoodshaik thanks for detailed debugging, and logs for the issue.

just opened gluster/glusterfs#4224, which hopefully should fix the issue.

Lets wait for comments from the experts, once it is found to be fine, will merge it in our base branch, and make a kadalu release.

@aravindavk @vatsa287

from kadalu.

amarts commented on June 2, 2024 1

@amarts a quick question, now that simple-quota is in gluster release branches, can we directly take gluster builds and containerize them for Kadalu?

We are actually doing almost the same thing. Only main reason we have decided to have our own branch was glusterfs's release timelines didn't match our requirements, and considering Red Hat (now IBM) has reduced frequency of releases too, we had to keep them managed ourself to provide the best possible experience to kadalu users.

If you notice, all the PRs added into our branch are submitted and merged in glusterfs devel branch already.

from kadalu.

mohammaddawoodshaik commented on June 2, 2024

Any update on this issue? This is blocking our promotions. Any help will be appreciated.

from kadalu.

mohammaddawoodshaik commented on June 2, 2024

Some more info collected from the FUSE client logs:

[2023-08-18 05:45:52.101240 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:52.101751 +0000] D [write-behind.c:1742:wb_process_queue] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7fe9)[0x7fda0dfeffe9] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xabb0)[0x7fda0dff2bb0] (--> /opt/lib/glusterfs/2023.04.17/xlator/debug/io-stats.so(+0x5efb)[0x7fda0dfbbefb] (--> /opt/lib/libglusterfs.so.0(default_stat+0xac)[0x7fda138325ec] ))))) 0-common-storage-pool-write-behind: processing queues
[2023-08-18 05:45:52.101780 +0000] D [MSGID: 0] [write-behind.c:1689:__wb_pick_winds] 0-common-storage-pool-write-behind: (unique=50, fop=STAT, gfid=df1f4e82-89e1-4a20-9985-b6ddc28651f4, gen=0): picking the request for winding
[2023-08-18 05:45:52.101801 +0000] D [MSGID: 0] [afr-read-txn.c:448:afr_read_txn] 0-common-storage-pool-replicate-0: df1f4e82-89e1-4a20-9985-b6ddc28651f4: generation now vs cached: 2, 2
[2023-08-18 05:45:52.101860 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:52.102037 +0000] D [write-behind.c:411:__wb_request_unref] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x76b1)[0x7fda0dfef6b1] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7ead)[0x7fda0dfefead] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x8ca8)[0x7fda0dff0ca8] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xabb0)[0x7fda0dff2bb0] ))))) 0-common-storage-pool-write-behind: (unique = 50, fop=STAT, gfid=df1f4e82-89e1-4a20-9985-b6ddc28651f4, gen=0): destroying request, removing from all queues
[2023-08-18 05:45:52.102368 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:52.102473 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:52.103284 +0000] W [MSGID: 108027] [afr-common.c:2888:afr_attempt_readsubvol_set] 0-common-storage-pool-replicate-0: no read subvols for /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee
[2023-08-18 05:45:52.103324 +0000] D [MSGID: 0] [afr-common.c:3082:afr_lookup_done] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-replicate-0 returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.103372 +0000] D [MSGID: 0] [utime.c:218:gf_utime_set_mdata_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-utime returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.103394 +0000] D [MSGID: 0] [write-behind.c:2371:wb_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-write-behind returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.103406 +0000] D [MSGID: 0] [io-stats.c:2284:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.103421 +0000] W [fuse-bridge.c:1052:fuse_entry_cbk] 0-glusterfs-fuse: 51: LOOKUP() /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee => -1 (Transport endpoint is not connected)
[2023-08-18 05:45:52.103611 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:52.103705 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:52.104244 +0000] W [MSGID: 108027] [afr-common.c:2888:afr_attempt_readsubvol_set] 0-common-storage-pool-replicate-0: no read subvols for /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee
[2023-08-18 05:45:52.104278 +0000] D [MSGID: 0] [afr-common.c:3082:afr_lookup_done] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-replicate-0 returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.104295 +0000] D [MSGID: 0] [utime.c:218:gf_utime_set_mdata_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-utime returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.104307 +0000] D [MSGID: 0] [write-behind.c:2371:wb_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-write-behind returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.104324 +0000] D [MSGID: 0] [io-stats.c:2284:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:52.104332 +0000] W [fuse-bridge.c:1052:fuse_entry_cbk] 0-glusterfs-fuse: 52: LOOKUP() /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee => -1 (Transport endpoint is not connected)
[2023-08-18 05:45:53.922949 +0000] D [name.c:171:client_fill_address_family] 0-common-storage-pool-client-2: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: server-common-storage-pool-2-0.common-storage-pool)
[2023-08-18 05:45:53.924929 +0000] E [MSGID: 101073] [name.c:254:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=0}, {ret=Name or service not known}]
[2023-08-18 05:45:53.924969 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-common-storage-pool-client-2: DNS resolution failed on host server-common-storage-pool-2-0.common-storage-pool
[2023-08-18 05:45:53.925123 +0000] D [MSGID: 0] [client.c:2235:client_rpc_notify] 0-common-storage-pool-client-2: got RPC_CLNT_DISCONNECT
[2023-08-18 05:45:53.925634 +0000] D [rpc-clnt-ping.c:90:rpc_clnt_remove_ping_timer_locked] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/libgfrpc.so.0(+0x6aae)[0x7fda1375caae] (--> /opt/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xd8)[0x7fda13763978] (--> /opt/lib/libgfrpc.so.0(+0xe8b8)[0x7fda137648b8] (--> /opt/lib/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7fda1375fbf6] ))))) 0-: : ping timer event already removed






[2023-08-18 05:45:56.925800 +0000] D [name.c:171:client_fill_address_family] 0-common-storage-pool-client-2: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: server-common-storage-pool-2-0.common-storage-pool)
[2023-08-18 05:45:56.927391 +0000] E [MSGID: 101073] [name.c:254:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=0}, {ret=Name or service not known}]
[2023-08-18 05:45:56.927424 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-common-storage-pool-client-2: DNS resolution failed on host server-common-storage-pool-2-0.common-storage-pool
[2023-08-18 05:45:56.927581 +0000] D [MSGID: 0] [client.c:2235:client_rpc_notify] 0-common-storage-pool-client-2: got RPC_CLNT_DISCONNECT
[2023-08-18 05:45:56.927895 +0000] D [rpc-clnt-ping.c:90:rpc_clnt_remove_ping_timer_locked] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/libgfrpc.so.0(+0x6aae)[0x7fda1375caae] (--> /opt/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xd8)[0x7fda13763978] (--> /opt/lib/libgfrpc.so.0(+0xe8b8)[0x7fda137648b8] (--> /opt/lib/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7fda1375fbf6] ))))) 0-: : ping timer event already removed
[2023-08-18 05:45:57.346944 +0000] D [write-behind.c:1742:wb_process_queue] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7fe9)[0x7fda0dfeffe9] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] (--> /opt/lib/glusterfs/2023.04.17/xlator/debug/io-stats.so(+0x5c95)[0x7fda0dfbbc95] (--> /opt/lib/libglusterfs.so.0(default_lookup+0xb4)[0x7fda138326e4] ))))) 0-common-storage-pool-write-behind: processing queues
[2023-08-18 05:45:57.347005 +0000] D [MSGID: 0] [write-behind.c:1689:__wb_pick_winds] 0-common-storage-pool-write-behind: (unique=53, fop=LOOKUP, gfid=00000000-0000-0000-0000-000000000001, gen=0): picking the request for winding
[2023-08-18 05:45:57.347159 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:57.347312 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:57.347533 +0000] D [write-behind.c:411:__wb_request_unref] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x76b1)[0x7fda0dfef6b1] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7ead)[0x7fda0dfefead] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x8ca8)[0x7fda0dff0ca8] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] ))))) 0-common-storage-pool-write-behind: (unique = 53, fop=LOOKUP, gfid=00000000-0000-0000-0000-000000000001, gen=0): destroying request, removing from all queues
[2023-08-18 05:45:57.348530 +0000] D [write-behind.c:1742:wb_process_queue] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7fe9)[0x7fda0dfeffe9] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] (--> /opt/lib/glusterfs/2023.04.17/xlator/debug/io-stats.so(+0x5c95)[0x7fda0dfbbc95] (--> /opt/lib/libglusterfs.so.0(default_lookup+0xb4)[0x7fda138326e4] ))))) 0-common-storage-pool-write-behind: processing queues
[2023-08-18 05:45:57.348552 +0000] D [MSGID: 0] [write-behind.c:1689:__wb_pick_winds] 0-common-storage-pool-write-behind: (unique=54, fop=LOOKUP, gfid=99c3c687-2955-4ddc-9468-bf8ce5dbc2f3, gen=0): picking the request for winding
[2023-08-18 05:45:57.348629 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:57.348728 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:57.348900 +0000] D [write-behind.c:411:__wb_request_unref] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x76b1)[0x7fda0dfef6b1] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7ead)[0x7fda0dfefead] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x8ca8)[0x7fda0dff0ca8] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] ))))) 0-common-storage-pool-write-behind: (unique = 54, fop=LOOKUP, gfid=99c3c687-2955-4ddc-9468-bf8ce5dbc2f3, gen=0): destroying request, removing from all queues
[2023-08-18 05:45:57.349811 +0000] D [write-behind.c:1742:wb_process_queue] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7fe9)[0x7fda0dfeffe9] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] (--> /opt/lib/glusterfs/2023.04.17/xlator/debug/io-stats.so(+0x5c95)[0x7fda0dfbbc95] (--> /opt/lib/libglusterfs.so.0(default_lookup+0xb4)[0x7fda138326e4] ))))) 0-common-storage-pool-write-behind: processing queues
[2023-08-18 05:45:57.349847 +0000] D [MSGID: 0] [write-behind.c:1689:__wb_pick_winds] 0-common-storage-pool-write-behind: (unique=55, fop=LOOKUP, gfid=a0d03120-a76d-4ec4-afda-64ef29befdda, gen=0): picking the request for winding
[2023-08-18 05:45:57.349958 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:57.350059 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:57.350240 +0000] D [write-behind.c:411:__wb_request_unref] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x76b1)[0x7fda0dfef6b1] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7ead)[0x7fda0dfefead] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x8ca8)[0x7fda0dff0ca8] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] ))))) 0-common-storage-pool-write-behind: (unique = 55, fop=LOOKUP, gfid=a0d03120-a76d-4ec4-afda-64ef29befdda, gen=0): destroying request, removing from all queues
[2023-08-18 05:45:57.350854 +0000] D [write-behind.c:1742:wb_process_queue] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7fe9)[0x7fda0dfeffe9] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] (--> /opt/lib/glusterfs/2023.04.17/xlator/debug/io-stats.so(+0x5c95)[0x7fda0dfbbc95] (--> /opt/lib/libglusterfs.so.0(default_lookup+0xb4)[0x7fda138326e4] ))))) 0-common-storage-pool-write-behind: processing queues
[2023-08-18 05:45:57.350878 +0000] D [MSGID: 0] [write-behind.c:1689:__wb_pick_winds] 0-common-storage-pool-write-behind: (unique=56, fop=LOOKUP, gfid=df1f4e82-89e1-4a20-9985-b6ddc28651f4, gen=0): picking the request for winding
[2023-08-18 05:45:57.350964 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:57.351058 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:57.351511 +0000] D [write-behind.c:411:__wb_request_unref] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x76b1)[0x7fda0dfef6b1] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x7ead)[0x7fda0dfefead] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0x8ca8)[0x7fda0dff0ca8] (--> /opt/lib/glusterfs/2023.04.17/xlator/performance/write-behind.so(+0xc1d0)[0x7fda0dff41d0] ))))) 0-common-storage-pool-write-behind: (unique = 56, fop=LOOKUP, gfid=df1f4e82-89e1-4a20-9985-b6ddc28651f4, gen=0): destroying request, removing from all queues
[2023-08-18 05:45:57.351760 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-0: ping timeout is 0, returning
[2023-08-18 05:45:57.351873 +0000] D [rpc-clnt-ping.c:290:rpc_clnt_start_ping] 0-common-storage-pool-client-1: ping timeout is 0, returning
[2023-08-18 05:45:57.352404 +0000] W [MSGID: 108027] [afr-common.c:2888:afr_attempt_readsubvol_set] 0-common-storage-pool-replicate-0: no read subvols for /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee
[2023-08-18 05:45:57.352446 +0000] D [MSGID: 0] [afr-common.c:3082:afr_lookup_done] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-replicate-0 returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:57.352511 +0000] D [MSGID: 0] [utime.c:218:gf_utime_set_mdata_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-utime returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:57.352537 +0000] D [MSGID: 0] [write-behind.c:2371:wb_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool-write-behind returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:57.352574 +0000] D [MSGID: 0] [io-stats.c:2284:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7fd9fc000d08, common-storage-pool returned -1 [Transport endpoint is not connected]
[2023-08-18 05:45:57.352600 +0000] W [fuse-bridge.c:1052:fuse_entry_cbk] 0-glusterfs-fuse: 57: LOOKUP() /subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee => -1 (Transport endpoint is not connected)
[2023-08-18 05:45:59.928025 +0000] D [name.c:171:client_fill_address_family] 0-common-storage-pool-client-2: address-family not specified, marking it as unspec for getaddrinfo to resolve from (remote-host: server-common-storage-pool-2-0.common-storage-pool)
[2023-08-18 05:45:59.929565 +0000] E [MSGID: 101073] [name.c:254:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=0}, {ret=Name or service not known}]
[2023-08-18 05:45:59.929592 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-common-storage-pool-client-2: DNS resolution failed on host server-common-storage-pool-2-0.common-storage-pool
[2023-08-18 05:45:59.929748 +0000] D [MSGID: 0] [client.c:2235:client_rpc_notify] 0-common-storage-pool-client-2: got RPC_CLNT_DISCONNECT
[2023-08-18 05:45:59.930056 +0000] D [rpc-clnt-ping.c:90:rpc_clnt_remove_ping_timer_locked] (--> /opt/lib/libglusterfs.so.0(_gf_log_callingfn+0x182)[0x7fda137c12f2] (--> /opt/lib/libgfrpc.so.0(+0x6aae)[0x7fda1375caae] (--> /opt/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xd8)[0x7fda13763978] (--> /opt/lib/libgfrpc.so.0(+0xe8b8)[0x7fda137648b8] (--> /opt/lib/libgfrpc.so.0(rpc_transport_notify+0x26)[0x7fda1375fbf6] ))))) 0-: : ping timer event already removed


^C
root@kadalu-csi-nodeplugin-lcc25:/var/log/gluster# ls /mnt/common-storage-pool_dawood/subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee
ls: cannot access '/mnt/common-storage-pool_dawood/subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee': Transport endpoint is not connected
root@kadalu-csi-nodeplugin-lcc25:/var/log/gluster# ls /mnt/common-storage-pool_dawood/subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee
ls: cannot access '/mnt/common-storage-pool_dawood/subvol/73/6d/pvc-b966a7fd-c00b-446c-9818-99fae5794aee': Transport endpoint is not connected
root@kadalu-csi-nodeplugin-lcc25:/var/log/gluster#

from kadalu.

mohammaddawoodshaik commented on June 2, 2024

From further debugging on the issue found some more info:

When subVol directories are stuck in healing state observed the following.

root@maglev-master-10-104-241-73:/data/srv/data/brick/subvol/a1/cb# getfattr -m . -d -e hex pvc-36830367-6b27-4d50-baf1-9c00e0c5d805/
# file: pvc-36830367-6b27-4d50-baf1-9c00e0c5d805/
trusted.afr.common-storage-pool-client-0=0x0000000000010acb00000000
trusted.afr.common-storage-pool-client-1=0x000000000000ea0600000000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xce671f3b681640f2948a709dc9694875
trusted.gfs.squota.limit=0x31303733373431383234
trusted.glusterfs.mdata=0x0100000000000000000000000064ea4c0c000000001a5f9a940000000064ea4bfd000000000c43bd360000000064ea4bfd000000000c43bd36
trusted.glusterfs.namespace=0x74727565

Here heal is yet to be done on client-0 and client-1
Now client-2 is picking-up this entry for heal pending and try to heal in other two bricks.
In the bricks heal is failing with an error Remove-Xattr : operation is not supported Client-2 logs

The message "I [MSGID: 108026] [afr-self-heal-common.c:1758:afr_log_selfheal] 0-common-storage-pool-replicate-0: Completed metadata selfheal on ce671f3b-6816-40f2-948a-709dc9694875. sources=[2]  sinks=" repeated 117 times between [2023-08-28 09:12:58.468222 +0000] and [2023-08-28 09:14:57.207600 +0000]
[2023-08-28 09:14:58.215228 +0000] I [MSGID: 108026] [afr-self-heal-metadata.c:50:__afr_selfheal_metadata_do] 0-common-storage-pool-replicate-0: performing metadata selfheal on ce671f3b-6816-40f2-948a-709dc9694875 
[2023-08-28 09:14:58.218331 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:1057:client4_0_removexattr_cbk] 0-common-storage-pool-client-0: remote operation failed. [{errno=95}, {error=Operation not supported}] 
[2023-08-28 09:14:58.220900 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:1057:client4_0_removexattr_cbk] 0-common-storage-pool-client-1: remote operation failed. [{errno=95}, {error=Operation not supported}] 
[2023-08-28 09:14:58.224025 +0000] I [MSGID: 108026] [afr-self-heal-common.c:1758:afr_log_selfheal] 0-common-storage-pool-replicate-0: Completed metadata selfheal on ce671f3b-6816-40f2-948a-709dc9694875. sources=[2]  sinks= 
The message "I [MSGID: 108026] [afr-self-heal-metadata.c:50:__afr_selfheal_metadata_do] 0-common-storage-pool-replicate-0: performing metadata selfheal on ce671f3b-6816-40f2-948a-709dc9694875" repeated 117 times between [2023-08-28 09:14:58.215228 +0000] and [2023-08-28 09:16:56.952424 +0000]

In other clients:

[2023-08-28 09:22:13.429601 +0000] I [MSGID: 115058] [server-rpc-fops_v2.c:777:server4_removexattr_cbk] 0-common-storage-pool-server: REMOVEXATTR info [{frame=613931}, {path=/subvol/a1/cb/pvc-36830367-6b27-4d50-baf1-9c00e0c5d805}, {uuid_utoa=ce671f3b-6816-40f2-948a-709dc9694875}, {name=}, {client=CTX_ID:322bad7c-cb23-4ae1-b717-8533410f2326-GRAPH_ID:0-PID:18-HOST:server-common-storage-pool-0-0-PC_NAME:common-storage-pool-client-1-RECON_NO:-0}, {error-xlator=-}, {errno=95}, {error=Operation not supported}] 
[2023-08-28 09:22:13.430435 +0000] I [MSGID: 0] [simple-quota.c:260:sq_update_hard_limit] 0-common-storage-pool-simple-quota: hardlimit update: ce671f3b-6816-40f2-948a-709dc9694875 1073741824 0 
[2023-08-28 09:22:14.444170 +0000] E [MSGID: 0] [server-rpc-fops_v2.c:2846:server4_removexattr_resume] 0-/bricks/common-storage-pool/data/brick: /subvol/a1/cb/pvc-36830367-6b27-4d50-baf1-9c00e0c5d805: removal of namespace is not allowed [Operation not supported]
[2023-08-28 09:22:14.444220 +0000] I [MSGID: 115058] [server-rpc-fops_v2.c:777:server4_removexattr_cbk] 0-common-storage-pool-server: REMOVEXATTR info [{frame=613940}, {path=/subvol/a1/cb/pvc-36830367-6b27-4d50-baf1-9c00e0c5d805}, {uuid_utoa=ce671f3b-6816-40f2-948a-709dc9694875}, {name=}, {client=CTX_ID:322bad7c-cb23-4ae1-b717-8533410f2326-GRAPH_ID:0-PID:18-HOST:server-common-storage-pool-0-0-PC_NAME:common-storage-pool-client-1-RECON_NO:-0}, {error-xlator=-}, {errno=95}, {error=Operation not supported}] 
[2023-08-28 09:22:14.444679 +0000] I [MSGID: 0] [simple-quota.c:260:sq_update_hard_limit] 0-common-storage-pool-simple-quota: hardlimit update: ce671f3b-6816-40f2-948a-709dc9694875 1073741824 0

@amarts @aravindavk @leelavg

from kadalu.

leelavg commented on June 2, 2024

As two other server pods are running fine, FUSE should not be showing error while accessing the data

you mentioned bringing down node interface on that node, are you sure components other than Kadalu came up fine?
along with server pod nodeplugin also would've went down, so after they are up did you restarted your application pod?

pls run heal info or trigger heals from provisioned pod but not directly on server pods
server pods are stateful sets, when the network was brought down maybe k8s interfered in some unknown way 🤔

having said above I couldn't possibly decode glusterfs logs and provide a way forward
I'm inclined not to try this scenario as this isn't at integration layer

from kadalu.

mohammaddawoodshaik commented on June 2, 2024

@leelavg
Forgot to mention some more info here.
Above info I have shared is from a system where all Server and node-plugin pods are running fine. Despite the fact we have tried full-heal multiple times we are seeing the above SubVol stuck in HealPeding state.

Coming back to Transport endpoint not connected issue:
We are seeing this issue as a consequence of the above one. As subVol is stuck in heal-pending state after couple nertwork flaps in different nodes(one at a time) we observe that Client is left with no brick with correct data for it to access, thus going to TransportNotConnected issue.

from kadalu.

leelavg commented on June 2, 2024

@amarts a quick question, now that simple-quota is in gluster release branches, can we directly take gluster builds and containerize them for Kadalu?

from kadalu.

mohammaddawoodshaik commented on June 2, 2024

@mohammaddawoodshaik thanks for detailed debugging, and logs for the issue.

just opened gluster/glusterfs#4224, which hopefully should fix the issue.

Lets wait for comments from the experts, once it is found to be fine, will merge it in our base branch, and make a kadalu release.

@aravindavk @vatsa287

Hello @amarts -
I have patched my clusters with the latest fix you have given. Despite our efforts to rectify the issue, it still persists.
Following the error log I see in Brick:
[2024-02-26 10:10:40.901386 +0000] I [MSGID: 115036] [server.c:494:server_rpc_notify] 0-common-storage-pool-server: disconnecting connection [{client-uid=CTX_ID:5365c705-8706-4c1e-b4d9-706543c870c0-GRAPH_ID:0-PID:217-HOST:server-common-storage-pool-2-0-PC_NAME:common-storage-pool-client-0-RECON_NO:-0}]
[2024-02-26 10:10:40.901536 +0000] I [MSGID: 101054] [client_t.c:374:gf_client_unref] 0-common-storage-pool-server: Shutting down connection CTX_ID:5365c705-8706-4c1e-b4d9-706543c870c0-GRAPH_ID:0-PID:217-HOST:server-common-storage-pool-2-0-PC_NAME:common-storage-pool-client-0-RECON_NO:-0
The message "I [MSGID: 108026] [afr-self-heal-metadata.c:50:__afr_selfheal_metadata_do] 0-common-storage-pool-replicate-0: performing metadata selfheal on e51affac-4d26-4649-94dc-2eb7d765dc7b" repeated 173 times between [2024-02-26 10:09:35.004196 +0000] and [2024-02-26 10:11:02.194570 +0000]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:1057:client4_0_removexattr_cbk] 0-common-storage-pool-client-2: remote operation failed. [{errno=95}, {error=Operation not supported}]" repeated 173 times between [2024-02-26 10:09:35.006733 +0000] and [2024-02-26 10:11:02.195794 +0000]
The message "I [MSGID: 108026] [afr-self-heal-common.c:1758:afr_log_selfheal] 0-common-storage-pool-replicate-0: Completed metadata selfheal on e51affac-4d26-4649-94dc-2eb7d765dc7b. sources=[0] 1 sinks=" repeated 173 times between [2024-02-26 10:09:35.009663 +0000] and [2024-02-26 10:11:02.197650 +0000]
[2024-02-26 10:11:03.200718 +0000] I [MSGID: 108026] [afr-self-heal-metadata.c:50:__afr_selfheal_metadata_do] 0-common-storage-pool-replicate-0: performing metadata selfheal on e51affac-4d26-4649-94dc-2eb7d765dc7b
[2024-02-26 10:11:03.201815 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:1057:client4_0_removexattr_cbk] 0-common-storage-pool-client-2: remote operation failed. [{errno=95}, {error=Operation not supported}]
[2024-02-26 10:11:03.203594 +0000] I [MSGID: 108026] [afr-self-heal-common.c:1758:afr_log_selfheal] 0-common-storage-pool-replicate-0: Completed metadata selfheal on e51affac-4d26-4649-94dc-2eb7d765dc7b. sources=[0] 1 sinks=

And also I have two SubVols(PVCs) created, and I see that these dirs are stuck in HealPending state, but data inside them is healing properly.
List of files needing a heal on common-storage-pool:
Brick server-common-storage-pool-0-0.common-storage-pool:/bricks/common-storage-pool/data/brick
/subvol/3a/74/pvc-bac13d60-444a-400d-a7e3-e343fd8b45c6
/subvol/aa/84/pvc-01833d52-a37d-4ded-b455-9a454d11343a

cc: @leelavg @aravindavk @vatsa287

from kadalu.

[Bug]: In Replica3 KadaluStorage FUSE is showing subdir as TranportEndpointNotConnected about kadalu HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent