Giter Site home page Giter Site logo

Comments (27)

timurhai avatar timurhai commented on June 13, 2024

Hello, Jan!
This function already runs in a cycle:

while( AFRunning)
{
// Update machine resources:
if ((cycle % ResourcesUpdatePeriod) == 0)
render->getResources();
// Let tasks to do their work:
render->refreshTasks();
// Update server (send info and receive an answer):
af::Msg * answer = render->updateServer();

Cycle in a cycle not needed.

from cgru.

ultra-sonic avatar ultra-sonic commented on June 13, 2024

Hi Timur,

here are the most common log entries that we found on frozen afrender clients:

Mon 25 Sep 09:34.06: INFO    Failed to connect to server. Last success connect time: Mon 25 Sep 09:33.23
Mon 25 Sep 09:34.06: INFO    Last connect was: 43 seconds ago, Zombie time: 180s
AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate':
10.10.8.150:51000: Connection refused

and

D=19970: Exit Code=0 Status=0
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Tue 26 Sep 10:15.31: INFO    Failed to connect to server. Last success connect time: Tue 26 Sep 10:15.18
Tue 26 Sep 10:15.31: INFO    Last connect was: 13 seconds ago, Zombie time: 180s
Tue 26 Sep 10:45.21: ERROR   RenderHost::stopTask: No such task: TP: j171 b2 t422 (now running 0 tasks).
Tue 26 Sep 10:45.21: ERROR   RenderHost::stopTask: No such task: TP: j963 b3 t17 (now running 0 tasks).

Is there any way to get more verbose logging especially on what is going on with our sockets?

To me it would be interesting to get a log output every time RCVTIMEO and SNDTIMEO have been reached (on both ends -afserver and afrender) - I suspect this to be the cause of the Resource temporarily unavailable error

to debug AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate': 10.10.8.150:51000: Connection refused
it would be great to know if the socket was in a TIME_WAIT state and therefore refused our connection attempt because the previous connection was still not properly terminated. is there a way to log more verbose info on connect failure as well?

I have a feeling that in our network we should increase RCVTIMEO and SNDTIMEO but not sure by how much? will this setting interfere with other timeouts like zombietime etc?

also please remind me where we can configure the delay that afrender waits between its connection attempts to the server. I think this should be longer then 120 secs because it may be that:

connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds
from https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux

Cheers
Oli

@sebastianelsner

from cgru.

timurhai avatar timurhai commented on June 13, 2024

Hi, Oliver!
Afanasy uses macros to output system call errors:
https://github.com/CGRU/cgru/blob/master/afanasy/src/libafanasy/name_afnet.cpp#L247
Macros are defined there:
https://github.com/CGRU/cgru/blob/master/afanasy/src/include/macrooutput.h
And just uses perror to print a system error message:
https://man7.org/linux/man-pages/man3/perror.3.html
For now i don`t know a way to get more detailed error explanation.
But may be someone, who more familiar with a Linux system calls can help.

from cgru.

timurhai avatar timurhai commented on June 13, 2024

You can monitor sockets states by such command as netstat, ss and so on.
TIME-WAIT can be a problem:
https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait
If so, may be you can reduce Maximum Segment Lifetime ?

from cgru.

ultra-sonic avatar ultra-sonic commented on June 13, 2024

Hi Timur,
we are still trying to figure out what is going on in our network, but I have a thing were you may be able to help us.

Today we had this behaviour:

  1. task finished
  2. wants to inform the server that it is done
  3. message time-out
  4. client takes on a new job after it has reconnected a bit later
  5. server still "thinks" the client is rendering the task that has already been completed and there is no way to get rid of the "ghost" task

see this log:

Thu 07 Dec 14:40.12: INFO    Finished PID=20050: Exit Code=256 Status=1
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:40.52: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:40.38
Thu 07 Dec 14:40.52: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:41.18: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:41.04
Thu 07 Dec 14:41.18: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:43.24: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:43.10
Thu 07 Dec 14:43.24: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:45.11: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:44.57
Thu 07 Dec 14:45.11: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:46.26: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:46.12
Thu 07 Dec 14:46.26: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:50.31: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:50.16
Thu 07 Dec 14:50.31: INFO    Last connect was: 15 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:59.38: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:59.24
Thu 07 Dec 14:59.38: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.

can you add a mechanism that detects such ghost tasks? or is there already something in place that just does not work in this case?

cheers
Oli

from cgru.

timurhai avatar timurhai commented on June 13, 2024

Hello, Oli!
You want to say, that new task generate message that server can read and answer. And old task still can't connect to server? This can't happen, as render connects server once per cycle and send & recv one data. Different tasks fills data and then reads own portion. So, if render can connect to server each task can connect to server. Or you want to say that it is not?

from cgru.

lithorus avatar lithorus commented on June 13, 2024

I just wanted to add that we're seeing the exact same thing. It tends to happen more often when the server is under heavy load and tasks are fast to exit. My theory is that it starts and stops before the server acknowledges the starting of the task, but I have yet been able to re-produce the issue in a non-production system.

from cgru.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.