Comments (27)
Hello, Jan!
This function already runs in a cycle:
cgru/afanasy/src/render/main.cpp
Lines 166 to 176 in 11e78f2
Cycle in a cycle not needed.
from cgru.
Hi Timur,
here are the most common log entries that we found on frozen afrender clients:
Mon 25 Sep 09:34.06: INFO Failed to connect to server. Last success connect time: Mon 25 Sep 09:33.23
Mon 25 Sep 09:34.06: INFO Last connect was: 43 seconds ago, Zombie time: 180s
AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate':
10.10.8.150:51000: Connection refused
and
D=19970: Exit Code=0 Status=0
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Tue 26 Sep 10:15.31: INFO Failed to connect to server. Last success connect time: Tue 26 Sep 10:15.18
Tue 26 Sep 10:15.31: INFO Last connect was: 13 seconds ago, Zombie time: 180s
Tue 26 Sep 10:45.21: ERROR RenderHost::stopTask: No such task: TP: j171 b2 t422 (now running 0 tasks).
Tue 26 Sep 10:45.21: ERROR RenderHost::stopTask: No such task: TP: j963 b3 t17 (now running 0 tasks).
Is there any way to get more verbose logging especially on what is going on with our sockets?
To me it would be interesting to get a log output every time RCVTIMEO and SNDTIMEO have been reached (on both ends -afserver and afrender) - I suspect this to be the cause of the Resource temporarily unavailable
error
to debug AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate': 10.10.8.150:51000: Connection refused
it would be great to know if the socket was in a TIME_WAIT
state and therefore refused our connection attempt because the previous connection was still not properly terminated. is there a way to log more verbose info on connect failure
as well?
I have a feeling that in our network we should increase RCVTIMEO and SNDTIMEO but not sure by how much? will this setting interfere with other timeouts like zombietime etc?
also please remind me where we can configure the delay that afrender waits between its connection attempts to the server. I think this should be longer then 120 secs because it may be that:
connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds
from https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux
Cheers
Oli
from cgru.
Hi, Oliver!
Afanasy uses macros to output system call errors:
https://github.com/CGRU/cgru/blob/master/afanasy/src/libafanasy/name_afnet.cpp#L247
Macros are defined there:
https://github.com/CGRU/cgru/blob/master/afanasy/src/include/macrooutput.h
And just uses perror to print a system error message:
https://man7.org/linux/man-pages/man3/perror.3.html
For now i don`t know a way to get more detailed error explanation.
But may be someone, who more familiar with a Linux system calls can help.
from cgru.
You can monitor sockets states by such command as netstat
, ss
and so on.
TIME-WAIT can be a problem:
https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait
If so, may be you can reduce Maximum Segment Lifetime ?
from cgru.
Hi Timur,
we are still trying to figure out what is going on in our network, but I have a thing were you may be able to help us.
Today we had this behaviour:
- task finished
- wants to inform the server that it is done
- message time-out
- client takes on a new job after it has reconnected a bit later
- server still "thinks" the client is rendering the task that has already been completed and there is no way to get rid of the "ghost" task
see this log:
Thu 07 Dec 14:40.12: INFO Finished PID=20050: Exit Code=256 Status=1
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:40.52: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:40.38
Thu 07 Dec 14:40.52: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:41.18: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:41.04
Thu 07 Dec 14:41.18: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:43.24: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:43.10
Thu 07 Dec 14:43.24: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:45.11: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:44.57
Thu 07 Dec 14:45.11: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:46.26: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:46.12
Thu 07 Dec 14:46.26: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:50.31: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:50.16
Thu 07 Dec 14:50.31: INFO Last connect was: 15 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:59.38: INFO Failed to connect to server. Last success connect time: Thu 07 Dec 14:59.24
Thu 07 Dec 14:59.38: INFO Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable
AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
can you add a mechanism that detects such ghost tasks? or is there already something in place that just does not work in this case?
cheers
Oli
from cgru.
Hello, Oli!
You want to say, that new task generate message that server can read and answer. And old task still can't connect to server? This can't happen, as render connects server once per cycle and send & recv one data. Different tasks fills data and then reads own portion. So, if render can connect to server each task can connect to server. Or you want to say that it is not?
from cgru.
I just wanted to add that we're seeing the exact same thing. It tends to happen more often when the server is under heavy load and tasks are fast to exit. My theory is that it starts and stops before the server acknowledges the starting of the task, but I have yet been able to re-produce the issue in a non-production system.
from cgru.
Related Issues (20)
- Error! Not have method reverse for list HOT 4
- Manual still refers to EOL version of Python HOT 1
- self.taskInfo in nuke.py ( service.py ) HOT 11
- afserver breaks job, which causes 'Segmentation fault (core dumped)' crashes HOT 6
- execute any command as JOB_DONE HOT 2
- cgru build pipeline HOT 3
- event jobs / sys jobs do not show stdout of commands executed but print massive amount of other info HOT 10
- Knowing the Job ID, Block Num and Task num inside a job HOT 3
- job picked up by render-node from different pool HOT 8
- Jobs with json in custom_data are invalid on server restart HOT 3
- WebUI: double clicking a block results in error
- WebUI: When clicking on the parameters menu nothing happens HOT 3
- Afanasy Crashing HOT 3
- Database backend for jobs HOT 4
- Dynamically release Tickets during execution HOT 5
- Solve order not working as expected HOT 6
- how do I get the capacity of the current render-node within the parser/service HOT 3
- exception in service.py causes tasks to be skipped HOT 7
- HDD free wrong? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cgru.