See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

I've previously seen a similar error with SOS: <a class="issue-link js-issue-link" dat

CreateStackWalk failed error messages in syslog about clrmd HOT 5 CLOSED

loop-evgeny commented on June 10, 2024

CreateStackWalk failed error messages in syslog

from clrmd.

Comments (5)

leculver commented on June 10, 2024 1

Thanks for reaching out!

(0x8000ffff is "catastrophic failure". Is this being loaded into WinDbg? It's the only API I know of that generates catastrophic failure HRESULTs, but it might be coming from somewhere I don't know about.)

These errors in the log are there to find problems (I mean the failed ones, not the extra spam we removed in 3.1 you mentioned in the other thread). That message does indicate that we could not walk the stack for a particular thread. So, if that thread is alive and has interesting/relevant stack frames or roots, then that message indicates you may be missing them and there may be a problem somewhere.

The causes of these failures are unfortunately varied:

A bug in ClrMD (definitely possible but unlikely in this case, this code has been working well for years).
A bug in your program. (Are you calling ClrThread.IsAlive before walking roots or the stack, like we do here? I should update documentation to call that out...)
A bug in the CLR debugging layer (mscordaccore.dll).
A problem with the debugger. WinDbg returns catastrophic error when you over release their interfaces or put the debugger in an invalid state.
"Normal" places where we cannot walk the stack due to pausing the runtime at a non-walkable stack location. (The CLR stack walker actually is designed to only work at certain "safe points". It's possible, but rare, that an application crashed while a thread was in an unwalkable state, pausing the process where our debugging layer/stackwalker can't walk that stack. This is normal and unfixable, but we still log an error to Trace because we don't have a way to detect that scenario.)

Unfortunately, it's not possible for me to know what's wrong without debugging it. I can only take a look at what's going on if you have a crash dump or coredump where I can reproduce the problem. I'm always happy to take a look if you do have a shareable dump where you see those messages, otherwise I'll (eventually) have to close this bug as unactionable...I am about to clean up old bugs and actually work on the ones that are fixable. :)

from clrmd.

loop-evgeny commented on June 10, 2024

I've previously seen a similar error with SOS: dotnet/runtime#9790

from clrmd.

loop-evgeny commented on June 10, 2024

Is this being loaded into WinDbg?

No, this is entirely on Ubuntu Linux, no WinDbg involved, but as we saw in dotnet/runtime#9790 SOS seems to generate that, too.

Are you calling ClrThread.IsAlive before walking roots or the stack?

No, I wasn't doing that. Adding a check for IsAlive seems to have fixed it (at least in my basic test) - thanks! I'll monitor it in prod.

That's quite a "gotcha", though. First, if EnumerateStackTrace() cannot get the stack trace I'd expect it to throw an exception, not return an empty enumerable and log something indecipherable to syslog. Second, if it knows that it cannot work on dead threads I'd expect it to throw a good exception that plainly says so.

from clrmd.

leculver commented on June 10, 2024

That's quite a "gotcha", though.

It's actually not meant to be a gotcha at all. It's perfectly fine to call those APIs and get nothing out of them. It's not illegal from a diagnostics standpoint to try to get the stack trace or gc roots, there just aren't any. (Though the error in the log is confusing, I agree on that.)

One thing to keep in mind about CLR debugging is that much of mscordaccore.dll is pretty sensitive to the state of the .Net Runtime. When you debug in a managed debugger using ICorDebug (e.g. Visual Studio), the runtime is always in a "good" state as far as the debugger is concerned. If you are debugging with LLDB or even just have a crash that paused the process at a random time (not at a debugger safepoint), then you can end up with "inconsistent state" as far as the runtime/debugging API is concerned.

One simple example of an inconsistent state might be your code ran new object[1204]. Imagine the GC is halfway through allocating that object when another thread crashes and pauses the process. The result might look like heap corruption, a section of the heap isn't walkable when you run !dumpheap, and !verifyheap might report errors...but the issues is actually that we got unlucky and paused the process and a time where our debugging layer can't make sense of the heap. This kind of thing is not fixable with our architecture and design, but that's ok, we get benefits in other areas.

This is why we don't throw exceptions when !ClrThread.IsAlive and you try to enumerate roots or a stack trace. Maybe the thread was just marked dead, but the native thread is still briefly alive...we could technically still walk the stack. ClrMD is meant to just give you the answers it can give you in that case.

log something indecipherable to syslog

To be honest, I never considered where these messages go on Linux. They show up in Visual Studio's trace events when you are debugging something, and it was useful to have them show up there. I typically use Trace statements for "this is really weird, I need to capture this state so that folks can diagnose what's going on later".

I will change them to not end up in the syslog if !clrThread.IsAlive. If the thread is alive, and you do see that in syslog though, it does indicate a possible failure that the clr team might need to investigate...or it could just be "inconsistent state" in one particular dump file. In any case, chasing that down is what the logging is for.

from clrmd.

loop-evgeny commented on June 10, 2024

I see, thanks for the detailed explanation!

Maybe the thread was just marked dead, but the native thread is still briefly alive...we could technically still walk the stack. ClrMD is meant to just give you the answers it can give you in that case.

Fair enough, I guess I see the trade-off.

If the thread is alive, and you do see that in syslog though, it does indicate a possible failure that the clr team might need to investigate...or it could just be "inconsistent state" in one particular dump file. In any case, chasing that down is what the logging is for.

It would be really difficult to chase down with the current amount of information in that log message, even if we do happen to see it. It doesn't include the thread ID or even the process ID. To me it would still make sense to throw an exception for such an error at least when IsAlive=true, and include the thread ID in the exception.

from clrmd.

CreateStackWalk failed error messages in syslog about clrmd HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent