Comments (8)
Some observations.
First, whatever exception is being thrown isn't being logged. This is a bug, and one that hampers the ability to figure out what's going on.
Second, the client (claims) it sends two vector reads: the first is 2885 bytes (2 KiB) and succeeds.
The second vector read fails. The domain log file and client logs agree that the failing vector read has two components: 55165 bytes (at offset 144200122), and 2482 bytes (at offset 145201649). The combined number of bytes is 57647 (~ 56 KiB).
The non-vector reads leading up to the failure are generally smaller; but there are some reads that are relatively big: 213447 bytes (208 KiB) and 13265 bytes (12 KiB), and 5680 bytes (5 KiB).
The maximum size of a TLS frame is 16 KiB.
My suspicion is that the first vector read succeeds because it is smaller than the maximum size of a TLS frame (2 KiB < 16 KiB). I've excluded the overhead from the xroot protocol (headers, etc); however, I would imagine that this would be (order-of-magnitude) a few bytes.
The result from the second vector read is clearly too big to fit in a single TLS frame (56 KiB > 16 KiB). Therefore, the response must be split into several TLS frames before it may be sent. If this isn't happening (for whatever reason) then I could imagine the socket throws an exception, which could trigger closing of the channel. Unfortunately, we don't have the details of why the channel was closed.
Interestingly, there are non-vector reads larger than the TLS limit (e.g., 208 KiB). If the problem is with attempting to send too much data over TLS then it is curious that non-vector reads succeed while a vector read doesn't. This suggests that vector reads are somehow handled differently from non-vector reads, when sending application-data over TLS.
That said, this is largely guess-work. It would be helpful if the pool were fixed, so the exception that closes the channel were logged.
from xrootd4j.
Also, since the problem is with xrootd4j, it would be interesting if the problem were to be confirmed using the stand-alone server, and not using dCache.
from xrootd4j.
So, first spend several weeks bringing the standalone server up to date (it has not been touched since I began working on xroot), then solve this problem? :-)
While I had overlooked the two-part vector read, I am not sure it means the buffer size is involved (though it could be), given your last observation. This is a placeholder issue, I haven't had time to explore more fully, but when I come back to it, I will begin with these useful observations, thanks.
How to get logging? That's tough, because it is probably coming from the SSLHandler, and that handler belongs to an independent class hierarchy. The handler's call stack thus probably bypasses dCache/xroot altogether.
It is possible to activate the SSL debug level, but I was having trouble originally making this work inside our pipeline without constantly slowing down the transfers. Certainly for one-off debugging, though, it is the thing to do next.
from xrootd4j.
I turned on javax.net, added Netty logging for the SslHandler, and added some more logging to the dCache xroot code.
Attached: pool debug output at TRACE level for no-tls (successful), and tls (stalls).
I also grabbed the large buffer/packet after which the tls version stalls in both. If you diff on the two files, you will see there is no difference.
There are no errors reported anywhere, just a stall on the part of the client.
I will now try to do this via remote debugging to see what that gets me.
pool-no-tls.log
pool-tls.log
notls-packet.txt
tls-packet.txt
So I am not even sure that the socket error is being generated on our side.
from xrootd4j.
This looks like the client talking to itself:
[2021-10-25 14:38:33.007003 -0500][Debug ][ExDbgMsg ] [dcatest04.fnal.gov:33127] MsgHandler created: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ).
[2021-10-25 14:38:33.007059 -0500][Debug ][ExDbgMsg ] [dcatest04.fnal.gov:33127] Moving MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ) from out-queu to in-queue.
[2021-10-25 14:39:33.070807 -0500][Debug ][AsyncSock ] [dcatest04.fnal.gov:33127.0] Closing the socket
[2021-10-25 14:39:33.072113 -0500][Debug ][Poller ] <[::ffff:131.225.67.36]:37782><--><[::ffff:131.225.13.6]:33127> Removing socket from the poller
[2021-10-25 14:39:33.072754 -0500][Debug ][PostMaster ] [dcatest04.fnal.gov:33127] Recovering error for stream #0: [ERROR] Socket error.
[2021-10-25 14:39:33.072973 -0500][Debug ][PostMaster ] [dcatest04.fnal.gov:33127] Reporting disconnection to queued message handlers.
[2021-10-25 14:39:33.073136 -0500][Debug ][XRootD ] [dcatest04.fnal.gov:33127] Handling error while processing kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647): [ERROR] Socket error.
[2021-10-25 14:39:33.073215 -0500][Error ][XRootD ] [dcatest04.fnal.gov:33127] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647)
[2021-10-25 14:39:33.073340 -0500][Debug ][ExDbgMsg ] [dcatest04.fnal.gov:33127] Passing to the thread-pool MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ).
[2021-10-25 14:39:33.073466 -0500][Debug ][ExDbgMsg ] [dcatest04.fnal.gov:33127] Calling MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ) with status: [ERROR] Socket error.
We are not sending back an error, as far as I can tell.
So the client evidently does not like something, but a comparison of what is going on on the pool currently reveals no difference that I can see between readv with and without TLS on.
from xrootd4j.
It could be that the xrootd client is expecting the TLS wrapping to work a particular way for vector reads (e.g., the boundary between two successful components of the vector read is never within the same TLS frame). However, that's just a guess.
So, two possible ways forward:
- get more information out of the xrootd client on what it believes is going wrong. This probably involves deep-diving into the code and/or asking Andy-et-al.
- run the xrootd server and compare the network traffic between what it provides and what dCache provides.
from xrootd4j.
Michal found a bug. From the looks of it, the client code was simply cutting out too soon in processing the response (it was not iterating).
Actually, testing this will be hard to do. I am using the experiment's environment to run the ROOT client. I would need to swap out Michal's fix for the currently installed client.
from xrootd4j.
With a standalone test provided by Michal
#include "XrdCl/XrdClFile.hh"
#include "XrdCl/XrdClFileSystem.hh"
#include <iostream>
int main( int argc, char *argv[] )
{
std::cout << "There we go!" << std::endl;
XrdCl::File f;
XrdCl::XRootDStatus s = f.Open( "xroots://fndcatemp2.fnal.gov:1094//pnfs/fs/usr/test/arossi/volatile/a048e67f-4397-4bb8-85eb-8d7e40d90763.dat", XrdCl::OpenFlags::Write );
if( !s.IsOK() ) return 1;
XrdCl::ChunkList chunks;
char buf1[2], buf2[4];
chunks.emplace_back( 0, 2, buf1 );
chunks.emplace_back( 5, 4, buf2 );
XrdCl::VectorReadInfo *vReadInfo = nullptr;
s = f.VectorRead( chunks, nullptr, vReadInfo );
if( !s.IsOK() ) return 2;
delete vReadInfo;
s = f.Close();
if( !s.IsOK() ) return 3;
std::cout << "The End!" << std::endl;
return 0;
}
compiled against xrootd 5.4/dev tools 7 (g++ 7.3.1)
we get success. See attached log.
This was a readv of the same file being read by our experiment (mu2e).
We can close this issue.
from xrootd4j.
Related Issues (15)
- handling stat request HOT 24
- malicious client can crash a pool on write HOT 4
- optimization of buffer usage on pools HOT 1
- Implement kXR_fattr and extended info for xKR_stat HOT 8
- issue with dCache 2.10.8 HOT 8
- Update parser so all paths may contain CGI/query elements HOT 5
- XrootD client implementation HOT 11
- more fun with AES cipher pad block errors, XrootD 4.9 HOT 6
- Stack-traced logged by mistake HOT 8
- Support TPC for ALICE tokens HOT 2
- Support kXR_query kXR_Qprep
- Support TLS HOT 1
- Error during decrypting/server-side key exchange: pad block corrupted HOT 35
- possible race in get checksum HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xrootd4j.