Giter Site home page Giter Site logo

Comments (5)

leculver avatar leculver commented on June 10, 2024 1

Thanks for reaching out!

(0x8000ffff is "catastrophic failure". Is this being loaded into WinDbg? It's the only API I know of that generates catastrophic failure HRESULTs, but it might be coming from somewhere I don't know about.)

These errors in the log are there to find problems (I mean the failed ones, not the extra spam we removed in 3.1 you mentioned in the other thread). That message does indicate that we could not walk the stack for a particular thread. So, if that thread is alive and has interesting/relevant stack frames or roots, then that message indicates you may be missing them and there may be a problem somewhere.

The causes of these failures are unfortunately varied:

  1. A bug in ClrMD (definitely possible but unlikely in this case, this code has been working well for years).
  2. A bug in your program. (Are you calling ClrThread.IsAlive before walking roots or the stack, like we do here? I should update documentation to call that out...)
  3. A bug in the CLR debugging layer (mscordaccore.dll).
  4. A problem with the debugger. WinDbg returns catastrophic error when you over release their interfaces or put the debugger in an invalid state.
  5. "Normal" places where we cannot walk the stack due to pausing the runtime at a non-walkable stack location. (The CLR stack walker actually is designed to only work at certain "safe points". It's possible, but rare, that an application crashed while a thread was in an unwalkable state, pausing the process where our debugging layer/stackwalker can't walk that stack. This is normal and unfixable, but we still log an error to Trace because we don't have a way to detect that scenario.)

Unfortunately, it's not possible for me to know what's wrong without debugging it. I can only take a look at what's going on if you have a crash dump or coredump where I can reproduce the problem. I'm always happy to take a look if you do have a shareable dump where you see those messages, otherwise I'll (eventually) have to close this bug as unactionable...I am about to clean up old bugs and actually work on the ones that are fixable. :)

from clrmd.

loop-evgeny avatar loop-evgeny commented on June 10, 2024

I've previously seen a similar error with SOS: dotnet/runtime#9790

from clrmd.

loop-evgeny avatar loop-evgeny commented on June 10, 2024

Is this being loaded into WinDbg?

No, this is entirely on Ubuntu Linux, no WinDbg involved, but as we saw in dotnet/runtime#9790 SOS seems to generate that, too.

Are you calling ClrThread.IsAlive before walking roots or the stack?

No, I wasn't doing that. Adding a check for IsAlive seems to have fixed it (at least in my basic test) - thanks! I'll monitor it in prod.

That's quite a "gotcha", though. First, if EnumerateStackTrace() cannot get the stack trace I'd expect it to throw an exception, not return an empty enumerable and log something indecipherable to syslog. Second, if it knows that it cannot work on dead threads I'd expect it to throw a good exception that plainly says so.

from clrmd.

leculver avatar leculver commented on June 10, 2024

That's quite a "gotcha", though.

It's actually not meant to be a gotcha at all. It's perfectly fine to call those APIs and get nothing out of them. It's not illegal from a diagnostics standpoint to try to get the stack trace or gc roots, there just aren't any. (Though the error in the log is confusing, I agree on that.)

One thing to keep in mind about CLR debugging is that much of mscordaccore.dll is pretty sensitive to the state of the .Net Runtime. When you debug in a managed debugger using ICorDebug (e.g. Visual Studio), the runtime is always in a "good" state as far as the debugger is concerned. If you are debugging with LLDB or even just have a crash that paused the process at a random time (not at a debugger safepoint), then you can end up with "inconsistent state" as far as the runtime/debugging API is concerned.

One simple example of an inconsistent state might be your code ran new object[1204]. Imagine the GC is halfway through allocating that object when another thread crashes and pauses the process. The result might look like heap corruption, a section of the heap isn't walkable when you run !dumpheap, and !verifyheap might report errors...but the issues is actually that we got unlucky and paused the process and a time where our debugging layer can't make sense of the heap. This kind of thing is not fixable with our architecture and design, but that's ok, we get benefits in other areas.

This is why we don't throw exceptions when !ClrThread.IsAlive and you try to enumerate roots or a stack trace. Maybe the thread was just marked dead, but the native thread is still briefly alive...we could technically still walk the stack. ClrMD is meant to just give you the answers it can give you in that case.

log something indecipherable to syslog

To be honest, I never considered where these messages go on Linux. They show up in Visual Studio's trace events when you are debugging something, and it was useful to have them show up there. I typically use Trace statements for "this is really weird, I need to capture this state so that folks can diagnose what's going on later".

I will change them to not end up in the syslog if !clrThread.IsAlive. If the thread is alive, and you do see that in syslog though, it does indicate a possible failure that the clr team might need to investigate...or it could just be "inconsistent state" in one particular dump file. In any case, chasing that down is what the logging is for.

from clrmd.

loop-evgeny avatar loop-evgeny commented on June 10, 2024

I see, thanks for the detailed explanation!

Maybe the thread was just marked dead, but the native thread is still briefly alive...we could technically still walk the stack. ClrMD is meant to just give you the answers it can give you in that case.

Fair enough, I guess I see the trade-off.

If the thread is alive, and you do see that in syslog though, it does indicate a possible failure that the clr team might need to investigate...or it could just be "inconsistent state" in one particular dump file. In any case, chasing that down is what the logging is for.

It would be really difficult to chase down with the current amount of information in that log message, even if we do happen to see it. It doesn't include the thread ID or even the process ID. To me it would still make sense to throw an exception for such an error at least when IsAlive=true, and include the thread ID in the exception.

from clrmd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.