Giter Site home page Giter Site logo

TLS with readv about xrootd4j HOT 8 CLOSED

dcache avatar dcache commented on May 30, 2024
TLS with readv

from xrootd4j.

Comments (8)

paulmillar avatar paulmillar commented on May 30, 2024

Some observations.

First, whatever exception is being thrown isn't being logged. This is a bug, and one that hampers the ability to figure out what's going on.

Second, the client (claims) it sends two vector reads: the first is 2885 bytes (2 KiB) and succeeds.

The second vector read fails. The domain log file and client logs agree that the failing vector read has two components: 55165 bytes (at offset 144200122), and 2482 bytes (at offset 145201649). The combined number of bytes is 57647 (~ 56 KiB).

The non-vector reads leading up to the failure are generally smaller; but there are some reads that are relatively big: 213447 bytes (208 KiB) and 13265 bytes (12 KiB), and 5680 bytes (5 KiB).

The maximum size of a TLS frame is 16 KiB.

My suspicion is that the first vector read succeeds because it is smaller than the maximum size of a TLS frame (2 KiB < 16 KiB). I've excluded the overhead from the xroot protocol (headers, etc); however, I would imagine that this would be (order-of-magnitude) a few bytes.

The result from the second vector read is clearly too big to fit in a single TLS frame (56 KiB > 16 KiB). Therefore, the response must be split into several TLS frames before it may be sent. If this isn't happening (for whatever reason) then I could imagine the socket throws an exception, which could trigger closing of the channel. Unfortunately, we don't have the details of why the channel was closed.

Interestingly, there are non-vector reads larger than the TLS limit (e.g., 208 KiB). If the problem is with attempting to send too much data over TLS then it is curious that non-vector reads succeed while a vector read doesn't. This suggests that vector reads are somehow handled differently from non-vector reads, when sending application-data over TLS.

That said, this is largely guess-work. It would be helpful if the pool were fixed, so the exception that closes the channel were logged.

from xrootd4j.

paulmillar avatar paulmillar commented on May 30, 2024

Also, since the problem is with xrootd4j, it would be interesting if the problem were to be confirmed using the stand-alone server, and not using dCache.

from xrootd4j.

alrossi avatar alrossi commented on May 30, 2024

So, first spend several weeks bringing the standalone server up to date (it has not been touched since I began working on xroot), then solve this problem? :-)

While I had overlooked the two-part vector read, I am not sure it means the buffer size is involved (though it could be), given your last observation. This is a placeholder issue, I haven't had time to explore more fully, but when I come back to it, I will begin with these useful observations, thanks.

How to get logging? That's tough, because it is probably coming from the SSLHandler, and that handler belongs to an independent class hierarchy. The handler's call stack thus probably bypasses dCache/xroot altogether.

It is possible to activate the SSL debug level, but I was having trouble originally making this work inside our pipeline without constantly slowing down the transfers. Certainly for one-off debugging, though, it is the thing to do next.

from xrootd4j.

alrossi avatar alrossi commented on May 30, 2024

I turned on javax.net, added Netty logging for the SslHandler, and added some more logging to the dCache xroot code.

Attached: pool debug output at TRACE level for no-tls (successful), and tls (stalls).

I also grabbed the large buffer/packet after which the tls version stalls in both. If you diff on the two files, you will see there is no difference.

There are no errors reported anywhere, just a stall on the part of the client.

I will now try to do this via remote debugging to see what that gets me.

pool-no-tls.log
pool-tls.log
notls-packet.txt
tls-packet.txt

So I am not even sure that the socket error is being generated on our side.

from xrootd4j.

alrossi avatar alrossi commented on May 30, 2024

This looks like the client talking to itself:

[2021-10-25 14:38:33.007003 -0500][Debug  ][ExDbgMsg          ] [dcatest04.fnal.gov:33127] MsgHandler created: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ).
[2021-10-25 14:38:33.007059 -0500][Debug  ][ExDbgMsg          ] [dcatest04.fnal.gov:33127] Moving MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ) from out-queu to in-queue.
[2021-10-25 14:39:33.070807 -0500][Debug  ][AsyncSock         ] [dcatest04.fnal.gov:33127.0] Closing the socket
[2021-10-25 14:39:33.072113 -0500][Debug  ][Poller            ] <[::ffff:131.225.67.36]:37782><--><[::ffff:131.225.13.6]:33127> Removing socket from the poller
[2021-10-25 14:39:33.072754 -0500][Debug  ][PostMaster        ] [dcatest04.fnal.gov:33127] Recovering error for stream #0: [ERROR] Socket error.
[2021-10-25 14:39:33.072973 -0500][Debug  ][PostMaster        ] [dcatest04.fnal.gov:33127] Reporting disconnection to queued message handlers.
[2021-10-25 14:39:33.073136 -0500][Debug  ][XRootD            ] [dcatest04.fnal.gov:33127] Handling error while processing kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647): [ERROR] Socket error.
[2021-10-25 14:39:33.073215 -0500][Error  ][XRootD            ] [dcatest04.fnal.gov:33127] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647)
[2021-10-25 14:39:33.073340 -0500][Debug  ][ExDbgMsg          ] [dcatest04.fnal.gov:33127] Passing to the thread-pool MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ).
[2021-10-25 14:39:33.073466 -0500][Debug  ][ExDbgMsg          ] [dcatest04.fnal.gov:33127] Calling MsgHandler: 0x982b190 (message: kXR_readv (handle: 0x00000000, chunks: 2, total size: 57647) ) with status: [ERROR] Socket error.

We are not sending back an error, as far as I can tell.

So the client evidently does not like something, but a comparison of what is going on on the pool currently reveals no difference that I can see between readv with and without TLS on.

from xrootd4j.

paulmillar avatar paulmillar commented on May 30, 2024

It could be that the xrootd client is expecting the TLS wrapping to work a particular way for vector reads (e.g., the boundary between two successful components of the vector read is never within the same TLS frame). However, that's just a guess.

So, two possible ways forward:

  1. get more information out of the xrootd client on what it believes is going wrong. This probably involves deep-diving into the code and/or asking Andy-et-al.
  2. run the xrootd server and compare the network traffic between what it provides and what dCache provides.

from xrootd4j.

alrossi avatar alrossi commented on May 30, 2024

Michal found a bug. From the looks of it, the client code was simply cutting out too soon in processing the response (it was not iterating).

xrootd/xrootd@38180cc

Actually, testing this will be hard to do. I am using the experiment's environment to run the ROOT client. I would need to swap out Michal's fix for the currently installed client.

from xrootd4j.

alrossi avatar alrossi commented on May 30, 2024

With a standalone test provided by Michal

#include "XrdCl/XrdClFile.hh"
#include "XrdCl/XrdClFileSystem.hh"

#include <iostream>

int main( int argc, char *argv[] )
{
  std::cout << "There we go!" << std::endl;

  XrdCl::File f;
  XrdCl::XRootDStatus s = f.Open( "xroots://fndcatemp2.fnal.gov:1094//pnfs/fs/usr/test/arossi/volatile/a048e67f-4397-4bb8-85eb-8d7e40d90763.dat", XrdCl::OpenFlags::Write );
  if( !s.IsOK() ) return 1;
  XrdCl::ChunkList chunks;
  char buf1[2], buf2[4];
  chunks.emplace_back( 0, 2, buf1 );
  chunks.emplace_back( 5, 4, buf2 );
  XrdCl::VectorReadInfo *vReadInfo = nullptr;
  s = f.VectorRead( chunks, nullptr, vReadInfo );
  if( !s.IsOK() ) return 2;
  delete vReadInfo;
  s = f.Close();
  if( !s.IsOK() ) return 3;

  std::cout << "The End!" << std::endl;
  return 0;
}

compiled against xrootd 5.4/dev tools 7 (g++ 7.3.1)

we get success. See attached log.

This was a readv of the same file being read by our experiment (mu2e).

We can close this issue.

main.log

from xrootd4j.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.