Giter Site home page Giter Site logo

Comments (9)

caio avatar caio commented on August 25, 2024

Interesting!

Assuming everything is running the same version this looks like it's either a problem with:

  • Your Identity and/or Codec implementation (since foca doesn't emit Option)- are you able to share those?; Or
  • Something in the network. Are the bytes you send the exact bytes you receive? Foca doesn't do any sort of checksum, so it's up to you to add crc/crypto/etc (see #17)

It's hard to diagnose Decode errors from afar- what you can do is store the data that caused the error and try to inspect it manually (it's a simple format, described here)

Capturing the debug logs (at least RUST_LOG=foca=debug) might help figuring this out, but I'm not sure it will as it doesn't sound like it's from within foca- But maybe your scenario is triggering a bug on send_message and making foca craft a broken packet somehow.


Now, a high rate of Notification::Rename is interesting in itself. If you're not expecting it to happen (i.e.: you're not restarting nodes, you're not on a flaky network, nodes are not overloaded all the time), it may be a sign of a problem.

I would try to figure out if it's a particular node that is constantly changing its identity or if it's generalized throughout the cluster. A generalized problem could mean you need to tune the foca config (maybe probe period is too short, maybe you need to give more time on suspect-to-down, maybe enable periodic gossiping)

from foca.

Manphil avatar Manphil commented on August 25, 2024

My Identity struct is like

struct FatIdentity {
  pub addr: SocketAddrV4,
  pub node: Option<Node>
  pub version: u64,
}

struct Node {
  pub addr: String,
  pub id: u64,
  pub extra: Vec<String>
}

My Codec is PostcardCodec.
The decode error is occurs on somewhere like

tracing::error!(error = tracing::field::debug(error), "Ignored error");

what you can do is store the data that caused the error and try to inspect it manually (it's a simple format, described here)

It's good idea, i will debug it.

I runs the cluster in k8s, i am not sure if the network is stable. I will confirm whether it is caused by packet loss.


I am a bit confused about periodic_gossip, which speed the cluster update. If i enable that and when Rename occured, would the Rename be faster and more frequent in cluster?

The probe period is a bit misunderstanding, Intuitively, a smaller period means more frequent keepalives, a lower probability of false positive downtime(Rename or Down)? I have a question, if node1 send probe1 to node2, and node1 not receive the reponse from node2 at the end of the period, the node1 mark node2 as suspect, at the next period, node1 send probe2 to node2 again, and node1 receive the response of probe2 from node2, at this moment, will node1 still mark node2 as suspect? Or change to Up?

from foca.

caio avatar caio commented on August 25, 2024

Thanks for the info! As I suspected, the Option is within your identity. I'd try some easy to craft tests to figure out where the issue lies. Things like:

  • Does it still happen if you use a simpler id?
  • What about changing the codec?

The fact that it only happens occasionally is really strange which is why I leaned on this being something to do with the network (assuming udp). Not so much packet loss (you wouldn't see errors), but corruption: the checksum udp gives is just 16bits, can be turned off and gets messed with at borders all the time, you should have some for the data.

But then... if it were corruption you wouldn't be seeing errors only about decoding an option heh. Quite puzzling without seeing the whole code and setup


The probe period is a bit misunderstanding, Intuitively, a smaller period means more frequent keepalives, a lower probability of false positive downtime(Rename or Down)?

I'd say it really depends on what's the cause for these false positives and how you're looking at the numbers. If there's always a chance for false positives, asking more often will mean that you'll see false positives more often.

Tuning probe period is a trade-off between cluster chatter/noise and how fast you detect that a node went down. What needs to happen once a node goes down varies heavily based on the use-case so there's no clear right approach: if you want uncoordinated, eventually consistent properties at the swim level, you're gonna have to add coordination on your side (ask other peers? try a tcp connection? wait a few seconds? it really depends on what's your goal)

Generally speaking, I'd recommend setting up the cluster with a generous probe period/rtt, ensuring things are as stable as you want (it's reasonable to expect 10-node LAN cluster to have zero Rename/Down events in a large time window if the nodes are sitting idle; It's unreasonable to expect anything good from the cloud :P). Once things are stable, then you start tuning for speed of failure detection otherwise you'll be tweaking many variables at once.

I am a bit confused about periodic_gossip, which speed the cluster update. If i enable that and when Rename occured, would the Rename be faster and more frequent in cluster?

Period gossip just changes how often a node sends its knowledge around. It speeds up how fast the whole cluster converges to the same membership status. Serf has a little simulator that helps understand how gossip fan out and frequency influence the convergence speed: https://www.serf.io/docs/internals/simulator.html

have a question, if node1 send probe1 to node2, and node1 not receive the reponse from node2 at the end of the period, the node1 mark node2 as suspect, at the next period, node1 send probe2 to node2 again, and node1 receive the response of probe2 from node2, at this moment, will node1 still mark node2 as suspect? Or change to Up?

The probe message contains an identifier and replies must match it. Every cycle the identifier changes, so replies to old ones are ignored. I wrote foca with the ability to ask these kind of questions in mind as I was learning the protocol too :) You might find it useful to read and play with the probe_ tests. Starting here: https://github.com/caio/foca/blob/main/src/lib.rs#L2866

from foca.

Manphil avatar Manphil commented on August 25, 2024

I captured the data from Foca::handle_data when it returns decode error, and compared with the data logged in the node which send the data. The data is same, no data pocket loss. Then, i create a new foca instance, and pass the data into Foca::handle_data to test it. Everything is ok, no errors occured, which is strange. The message type of decoded header is Message::TurnUndead.

I will change the codec to bincode, and see if the errors exists.

from foca.

caio avatar caio commented on August 25, 2024

Oh! Is it possible one of your nodes (the one with the decode error) is running an older foca version? That would 100% explain the decode error being strange and only happening occasionally

One of the many things foca doesn't do is version the protocol and TurnUndead was added somewhat recently- nodes that don't know about it will work kind of ok because not much else has changed in the header

from foca.

Manphil avatar Manphil commented on August 25, 2024

The foca version is unlikely not same. I run the cluster in k8s, and each container uses the same image. But the decode error is indeed occasional, so it's strange. It is more likely to occur when there is a certain amount of the network traffic, so I initially suspected it was caused by udp packet loss.

from foca.

caio avatar caio commented on August 25, 2024

Managed to trigger it here, knowing the message type helped- thanks a lot!

The error is being caused by turnundead messages containing custom broadcasts when they shouldn't

The reason it didn't break when you tested on another node is because the header decodes correctly then, since the message is not to the same identity, it gets discarded. With the correct identity it tries to interpret the custom broadcast as a u16+sequence-of-identities 💣

I'll ship a fix for it tomorrow

from foca.

Manphil avatar Manphil commented on August 25, 2024

Oh, you are right, sorry for that! Yesterday I new the foca instance with wrong identity, so it returns from

foca/src/lib.rs

Lines 902 to 906 in 1e8c322

if !self.accept_payload(&header) {
#[cfg(feature = "tracing")]
tracing::trace!("Payload not accepted");
return Ok(());

I renew a foca instance with the dst identity of the captrued data, the decode error reproduced. The error returns from the first call of decode_member:

foca/src/lib.rs

Lines 920 to 923 in 1e8c322

self.codec
.decode_member(&mut data)
.map_err(anyhow::Error::msg)
.map_err(Error::Decode)?,

from foca.

caio avatar caio commented on August 25, 2024

I've released v0.17.2 with a fix. Closing this issue, feel free to reopen.

Thank you very much for the report and assistance debugging!

from foca.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.