Giter Site home page Giter Site logo

Handlers hang forever about xtra HOT 16 CLOSED

restioson avatar restioson commented on May 23, 2024
Handlers hang forever

from xtra.

Comments (16)

Restioson avatar Restioson commented on May 23, 2024 1

Could you show the lazy statics? Also, is it not more accurate to call this issue "spawns in spaad" hang forever, since it doesn't even get to the get_config_value? I'll look at this in the morning since it's 1am right now in my timezone and I'm heading to bed, but I wonder if it can be reduced to a simplified reproducing example?

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Update: interestingly enough, if I call the identical method in an identical way like this:

#[tokio::main(flavor = "multi_thread")]
async fn main() {
  let c = block_on(config::CONFIG.get_config_value());
  log::debug!("Got config: {:#?}", c);
}

Then this works 100% of the time. I am stumped.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Oh, correction: CONFIG's get_config_value() handler works inside spawn bodies and regular async functions. It doesn't seem to work inside the new fn as I need it to.

It also doesn't work if I spawn a task inside the new fn.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Thanks so much for replying so late, make sure you get your rest! This library is great, I appreciate the hard work OSS community members like you put in, I am very grateful.

I believe the real issue is that inside ActorA's spawn fn, it uses a method on the lazy_static, causing ActorB to spawn, I think the spawn-in-spawn being the issue.

The lazy_static impl for all my actors is the same, ex:

lazy_static::lazy_static! {
  pub static ref CONFIG: Config = {
    Config::new(&mut xtra::spawn::Tokio::Global)
  };
}

I am guessing that this spawn-in-spawn is either an edge case in the library or it's a limitation of the library, not sure.

I have figured out a work around. Mimicking the example above:

#[spaad::entangled]
pub struct Agent {
  my_state: HashMap<String, ConfigValue>
}

#[spaad::entangled]
#[async_trait::async_trait]
impl Actor for Agent {
  async fn started(&mut self, _ctx: &mut Context<Self>) {
    let c = CONFIG.get_config_value().await;
    // modify `self.my_state` here
  }
}

#[spaad::entangled]
impl Agent {
  #[spaad::spawn]
  pub fn new() -> Self {
    Self {
      my_state: HashMap::new()
    }
  }
}

This workaround works because (I think) started is guaranteed to be run before any other handler, so the behavior is the same as what I had written originally.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

This is a strange issue, and I want to look into it more. Potentially it could be an edge case of xtra, spaad, lazy static, or even tokio. Would you mind making an illustrative minimum reproducing example so I can test it out a bit?

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

I was wrong, that is not the issue, or it's not so simple unfortunately. The following does not hang:

use futures::executor::block_on;
use xtra::prelude::*;
use lazy_static::lazy_static;

#[spaad::entangled]
struct ActorB { }

#[spaad::entangled]
impl Actor for ActorB {}

#[spaad::entangled]
impl ActorB {
  #[spaad::spawn]
  pub fn new() -> Self { Self { } }

  #[spaad::handler]
  pub fn get_info(&self) -> usize { 137 }
}

#[spaad::entangled]
struct ActorA { info: usize }

#[spaad::entangled]
impl Actor for ActorA {}

#[spaad::entangled]
impl ActorA {
  #[spaad::spawn]
  pub fn new() -> Self {
    // `.get_info` causes ActorB to spawn
    let info = block_on(ACTOR_B.get_info());
    println!("Got info: {}", info);

    Self { info }
  }

  #[spaad::handler]
  pub fn do_something(&self) {
    println!("ActorA doing something");
  }
}

lazy_static! {
  static ref ACTOR_A: ActorA = {
    ActorA::new(&mut xtra::spawn::Tokio::Global)
  };

  static ref ACTOR_B: ActorB = {
    ActorB::new(&mut xtra::spawn::Tokio::Global)
  };
}

#[tokio::main(flavor = "multi_thread")]
async fn main() {
  // `.do_something` causes ActorA to spawn, and in turn spawn ActorB.
  ACTOR_A.do_something().await;
}

Yields

Got info: 137
ActorA doing something

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Actually not only that, but my workaround does not always work.

What I know:

  • it seems like started is always reached
  • inside started, CONFIG.get_config_value() hangs around half the time.

I don't know if it's my implementation inside Config, but it's dead simple, it just reads a file from disk in initialization and then clones a part of the deserialized struct when you call get_config_value.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

I wrote a mock implementation of my real project's Config, which does no setup work
and only returns an empty hashmap (unlike the real Config which causes issues).

The mock version seems to never hang.

Something inside the Config initialization is causing everything to hang (but unpredictably).

It synchronously reads from disk, could be the problem, although I have no good theoretical reason to believe that yet.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Wrong again.

Stalling the thread for 5 seconds inside the new function on the Mock struct doesn't cause failures.

Also, the CONFIG is initialized long (like 3000ms) before the AGENT ever gets initialized,
and the handlers on CONFIG must be used successfully numerous times (to get credentials etc) before we can even get to the initialization of AGENT. My above theories don't even make sense and I can't reproduce it outside my project.

I am so confused at this point.

  • CONFIG replies immediately, just not when AGENT is asking.
  • It's not spawn-in-spawn
  • It's not long-running blocking tasks in new
  • It's not something about the initialization of CONFIG, that works every single time.
  • It's not something about the handlers of CONFIG, those work all over the codebase before AGENT comes into the picture.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

Hmm, the inconsistency here suggests to me they it could be some kind of race with a lock somewhere. Poking around in a debugger (GDB etc) is probably your best bet, to see where it's stalled. With GDB you'd just run your process until it deadlocks, connect to it with the debugger after it's deadlocked, and then run a backtrace and poke around.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

I have tried the forwards and backwards approach to that (follow it to the deadlock, trace from the deadlock), but it is a giant mess of Tokio threads and lots of Tokio internals, and I could not get either debugging strategy to work.

So when it deadlocks, I don't know what it's doing.

Are there many locks in this system that exist at the xtra level? There must be a soundness issue with their use somewhere, if there are few it could be easier to carefully examine their usage.

There are also deadlock sanitizers available for rust that might help.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

Are there many locks in this system that exist at the xtra level?

xtra itself does not bring in any locks - by this I mean in the code I do not explicitly lock anything. However, it is built on several things which do use locks:

  1. tokio itself
  2. catty, the oneshot channel libary
  3. flume, the MPMC channel library
  4. barrage, the broadcast channel library
  5. futures

There must be a soundness issue with their use somewhere, if there are few it could be easier to carefully examine their usage.

Although this is possible, I highly doubt this as the vast majority of xtra's dependencies do not use unsafe at all, or are highly vetted (although it's still possible for those crates):

  1. tokio and futures are much-used crates and it's not likely that this is the first time it's come up, if the issue does exist somewhere in them, although it's possible.
  2. catty, flume, and barrage all use no unsafe.

It might be possible that some dependencies of these are unsound, although this is somewhat doubtful to me. It seems like a deadlock rather than unsoundness. Note that deadlocking is completely safe and sound, as it doesn't create any memory safety errors, data races, etc. I think the issue is far more likely

  1. accidental misuse of some API in lazy_static due to some initialisation order issue leading to a deadlock
  2. a bug in xtra somewhere

Is the project that you're working on and experiencing this bug in open source? If so, could I clone it and test it out? If not, there might be a couple things to try with debugging it:

  1. manage to get a reproducing example (this appears to be hard since in these situations it's hard to know exactly what interaction is causing an issue)
  2. try swap out tokio for another, simpler futures runtime (maybe even just pollster::block_on) to reduce the backtrace complexity
  3. Rewrite the offending code into vanilla xtra temporarily so there are less moving parts (this is probably going to be the lowest hanging fruit with the largest return). Then we can isolate the issue to xtra or spaad, and debugging through proc macros is a bit of a pain.
  4. read the code especially carefully and see if there's anything you might have missed. This is probably the hardest and least consistent option but it might work.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

By the way, I'm a little suspicious of the block_on(CONFIG.get_config_value()). Where is the caller running? You might be blocking the runtime here.

from xtra.

wbrickner avatar wbrickner commented on May 23, 2024

Unfortunately the system is proprietary, which is why I have been a bit cagey / have been posting example-like code here, so I can't share the actual source with you. Suffice to say, I wanted help, so I made sure the structure and any important details were mirrored. Looks like there's something about my system that either creates / teases out this bug that is not present in my example code.

I have had to move on. I realized that my use case could replace all uses of xtra with a bunch of tokio::sync::Mutex,
and it has resolved all the hanging issues.

  • So, block_on is not the issue, even doing CONFIG.get_my_config().await also fails inside the async fn started life cycle method.
  • Recall that it is not the initialization of CONFIG either, initialization occurs long before and the handlers work in other places, but as soon as Agent is initialized, the handlers hang indefinitely.
  • The method handler I declare does not even get called, the deadlock (if that is the root cause) occurs as soon as I call .await, but before the method actually executes any code.

I agree, I don't think it is an issue with unsafe or memory crimes, it is instead probably a concurrency unsoundness thing like you mention.

Everything points to mystery, my bet is that there is a flaw in the code generation in spaad.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

Unfortunately the system is proprietary, which is why I have been a bit cagey / have been posting example-like code here

No problem at all, I understand.

So, block_on is not the issue, even doing CONFIG.get_my_config().await also fails inside the async fn started life cycle method.

This implies to me that there is a dependency between ActorA and ActorB forming some kind of loop where each is waiting for the other to finish some work to continue.

The method handler I declare does not even get called, the deadlock (if that is the root cause) occurs as soon as I call .await, but before the method actually executes any code.

I assume you mean do_something here? Two possibilities:

  1. new returns the actor correctly and the issue is in sending the message and waiting for a response
  2. new doesn't ever return before the await is called

2 seems more likely to me.

Expanding the example that you sent yields the following for the inner new method:

    #[allow(unused_imports)]
    use __ActorAActor::ActorA;
    impl __ActorAActor::ActorA {
        pub fn new() -> Self {
            let info = block_on(ACTOR_B.get_info());
            {
                ::std::io::_print(::core::fmt::Arguments::new_v1(
                    &["Got info: ", "\n"],
                    &match (&info,) {
                        (arg0,) => [::core::fmt::ArgumentV1::new(
                            arg0,
                            ::core::fmt::Display::fmt,
                        )],
                    },
                ));
            };
            Self { info }
        }
    // ...
}

So, this is exactly what you wrote, as expected, since this part of the code simply wraps the given method.

And this is the wrapping new method as generated:

impl ActorA {
    pub fn new<ActorSpawner: ::spaad::export::xtra::spawn::Spawner>(
        actor_spawner: &mut ActorSpawner,
    ) -> Self {
        use ::spaad::export::xtra::prelude::*;
        let act = __ActorAActor::ActorA::new();
        let addr = act.create(::std::option::Option::None).spawn(actor_spawner);
        ActorA { addr }
    }
    // ...
}

So, this just invokes the user code, calls Actor::create on the returned actor from the user code, and then spawns it using the given actor spawner (tokio in your case). Then, it returns the wrapper.

If what you're saying is indeed true (that it hangs on the call to ACTOR_B.get_info()), I do not see any way how spaad codegen could have caused this (as the spaad code is just so simple), or indeed xtra code. This seems to be completely down to the specifics of the new method and potential interdependencies between ActorA and ActorB (for instance, see #13). Does ACTOR_B ever try to get or wait on ACTOR_A?

Here is the rest of the generated code.

from xtra.

Restioson avatar Restioson commented on May 23, 2024

Closing as unable to reproduce.

from xtra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.