Skip to content

fix: split inproc handler thread#1446

Open
supervacuus wants to merge 136 commits intomasterfrom
fix/split_inproc_handler_thread
Open

fix: split inproc handler thread#1446
supervacuus wants to merge 136 commits intomasterfrom
fix/split_inproc_handler_thread

Conversation

@supervacuus
Copy link
Collaborator

@supervacuus supervacuus commented Nov 10, 2025

This is a more elaborate, long-term fix to getsentry/sentry-java#4830 than #1444.

It also finishes the work done here: #1088
And fixes the issues raised here:

So, while the driver for this PR is a downstream issue that exposes the signal-unsafety of some parts of the current inproc implementation, it also addresses a much broader range of concerns that regularly affect inproc users on all platforms.

At a high level, it introduces a separate handler thread for inproc, which the signal handler (or UEF on Windows) wakes after it exchanges crash context data.

The idea is that we minimize signal handler/UEF to do the least amount of syscall stuff (or at least the subset documented in the signal-safety man-page), while the handler thread can execute functions outside that range (with limitations, since thread sync and heap allocations are still problematic). This allows us to reuse stdio functionality like formatters without running squarely into UB territory or having to rewrite all utilities to async-signal-safe versions, as in #1444.

There are a few considerable changes to mention:

  • since we run the event construction in a separate handler thread, the use of backtrace() or any unwinder that runs from the "current" instruction address is entirely useless (ignoring the fact that backtrace() was always signal-unsafe to begin with, which itself was the source of crashes, hangs or just empty stack traces).
  • this means we require a "user context"-based stack walker in inproc, which we already partially acknowledged in Using libunwind for mac, since backtrace do not expect thread context… #1088 and fix: support musl on Linux #1233.
  • on Linux, this PR requires libunwind (the nognu implementation, not the llvm one, which is a pure C++ exception unwinder), which is a breaking change (at least in the sense that users now require an additional dependency at build and runtime). This means that the "general" Linux usage is now the same as with the musl libc environments.
  • on macOS, we provide a user context stack-walker based on frame pointer records for arm64 and x86-64, and use the system-provided libunwind for the default stack-trace from a call-site. It turned out that the system-provided libunwind wasn't safe enough to use in the context of the signal handler (either led to hangs or had issues with escaping the trampoline). This means users can now use inproc on macOS again (if their code is compiled without omitting frame pointers, which is always the case by default on macOS).

Further improvements/fixes (summarizing the 30 commits, which I didn't want to squash):

  • the libunwind-based unwinder modules now also validate retrieved ucontext pointers against memory mapping (for Linux and macOS)
  • got rid of all remaining __sync functions and replaced them with __atomic (especially the signal handler blocking logic and the spinlock)
  • rectified the inconsistent usage of C++ new with std::nothrow throughout the affected backend code (including the initialization of crashpad_state_t, which still used malloc and memset although it has std::atomic members)
  • cleaned up the CMake configure phase of the integration test suite.
  • ensures that test fixtures do not end up in macOS bundles
  • fixes build issues with by-default PIE and LTO builds
  • musl is no longer a special case "Linux" in the build script
  • fixes a couple of warnings and test-case instabilities
  • introduce macos-26 build config

TODOs:

  • finish the x86-64 stackwalker for macOS (and clean up the code)
  • Figure out if we need the libbacktrace fallback at all and how to handle it.
  • provide a module-level description of the new mechanism in inproc
  • Decide on having the change
  • Update documentation
    • Advanced usage might be outdated wrt to signal handling of inproc
    • Remove mentions of inproc not working on macOS
    • Clarify the new libunwind dependency on Linux

* use `std::nothrow` `new` consistently to keep exception-free semantics for allocation
* rename static crashpad_handler to have no module-public prefix
* use `nullptr` for arguments where we previously used 0 to clarify that those are pointers
* eliminate the `memset()` of the `crashpad_state_t` initialization since it now contains non-trivially constructable fields (`std::atomic`) and replace it with `new` and an empty value initializer.
…ld, since libraries like libunwind.a might be packaged without PIC.
…ms with architecture prefixes (32-bit Linux)
…stack

also ensure to get the first frame
harmonize libunwind usage
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

…m is not enough, because the android tests run on macOS runners)
…sed unwinder to recover a frame when the callee has no frame

 record. So, lets limit that particular test-case to aarch64.
@supervacuus
Copy link
Collaborator Author

supervacuus commented Feb 13, 2026

Summary of all technical changes since the last review (@mujacica, @JoshuaMoelans, @jpnurmi)

  • I added signal-safe logging macros to inproc. This allows us to perform basic logging in signal handlers without introducing additional stdio formatting. This is currently only in inproc because it doesn't align with the other logging capabilities, but we can move it later if needed.
  • I introduced a minimal state-machine for entering the signal handler from multiple threads that acts orthogonally to the thread blocker we get with sentry__enter_signal_handler() and which runs on all platforms. It acts as a policy that removes interferences from other threads that crashed following:
    • The first actual crash that enters wins the race
    • All other threads will be blocked
    • until the first thread finished reporting and unregistered our signal handlers (at which points the blocked threads return and, at worst, reenter the default handler)
    • this also allows us to safely transfer sentry__enter_signal_handler() from the crashed thread to the handler thread
  • The handler thread also gets a sigaltstack if it didn't get one from the libc implementation. This prevents stack overflow in the hooks from exhausting a recursive report attempt.
  • sentry__enter_signal_handler() now also tracks how often a thread has entered the signal handler and returns that value. This allows us to cleanly cut handler recursion where we want it, rather than deferring that decision to however a particular environment does it (or how abort() resets the signal mask).
  • With the above i also introduce a recursion policy that is currently only implemented for inproc (and obviously open for discussion):
    • on first entry, the handler processes as it normally should (I still hope this is beyond the 95th percentile of scenarios)
    • on first re-entry (meaning we had a crash during handling), we disable the hooks, because those are often the cause for crashes and the code we run there is entirely out of our control) but still attempt to report
    • A second re-entry means we bail out and do not attempt to create a report (this prevents stupid endless cycles of handler invocations)
    • We wait for the handler thread to state transition to ready in sentry_init() and fail and clean up if it times out.
  • inproc now also has an abort() handler on Windows, analog to how crashpad did it.
  • inprocnow reuses the logger_disable whenever it falls back to creating a crash report from within the signal handler (dramatically reducing our exposure to stdio formatting in the signal handler)
  • inproc now acts more defensively if any of the signal handler to handler thread sync mechanisms fail.
  • I introduced a separate integration test suite (still uses pytest as a runner) that doesn't fill example.c with additional cases, which is entirely focused on triggering edge cases in the inproc backend. This was hugely helpful, and I recommend continuing down this route for any testing that doesn't involve a user-facing API. I also restructured the pytest build configuration so that tests such as the inproc stress-test can reuse configuration that isn't specific to building example.c.
  • I made the inproc stress-test also work on Android, meaning we now have an integration test setup that allows us to run more complex things than unit tests can cover against bionic. This is not entirely generic; feel free to extend and reuse as needed.
  • I introduced the signal-safe address formatter from fix: use a signal-safe address formatter #1444 to this PR, because the split is only a best effort, and we should still reduce our exposure to stdio formatting as much as possible (not gonna fix all formatters in this PR, but this one already exists).
  • We now test abort() handling in CI across backends. This was previously deactivated due to stuck Windows runners (aborting leads to dialogs that require interaction). The pop-up is deactivated. abort() is sufficiently distinct from other terminal signals that it should be tested everywhere.
  • We have a macOS arm64e build config in CI now that actually checks the code-base for PAC handling (breakpad is excluded!). As part of this, I found that GHA runner configs documented for arm64 machines are often scheduled on x86_64 machines. To track this i added the "Debug runner architecture" to the CI workflow.
  • Improved PAC handling in the macOS stack walker, and also added some specifics on how aarch64 handles stack-walking after the first frame (the inline comments go into more detail on what is handled, and there are tests now that actually check the sequencing of stack frames for given top-frame scenarios).
  • SENTRY_WITH_UNWINDER_LIBBACKTRACE configurations that fall back on backtrace() get a warning log on initialization.

@supervacuus supervacuus requested review from JoshuaMoelans, jpnurmi and mujacica and removed request for jpnurmi February 13, 2026 08:43
Copy link
Collaborator

@jpnurmi jpnurmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the crashpad and breakpad changes related?

* This is a replacement for `snprintf` in signal handlers:
* - signal-safe: uses no stdio, malloc, locks, or thread-local state.
* - reentrant: only stack locals; no writable globals.
* Returns 0 on success.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Returns 0 on success" but returns bool

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I only copied the code from the old branch and didn't check whether the inline docs survived the code changes. Thanks. The function should likely not return anything.

@supervacuus
Copy link
Collaborator Author

Are the crashpad and breakpad changes related?

Only in the sense that I was getting compiler errors with Werror on one of my test setups, and I thought: Why not fix it. The changes are not behavioral at all: the initial code was already written as if new could return a nullptr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants