Skip to content

Fix spurious "thread death" TOCTOU race in python_worker#8

Merged
ctrueden merged 1 commit into
apposed:mainfrom
NicoKiaru:fix/thread-death-toctou-issue-15
Jun 4, 2026
Merged

Fix spurious "thread death" TOCTOU race in python_worker#8
ctrueden merged 1 commit into
apposed:mainfrom
NicoKiaru:fix/thread-death-toctou-issue-15

Conversation

@NicoKiaru
Copy link
Copy Markdown
Contributor

Summary

Fixes the sporadic spurious "thread death" task failures reported in apposed/appose#15.

Root cause

In Worker._process_input(), non-main tasks were wired up as:

task._thread = Thread(target=task._run, name=f"Appose-{uuid}")
task._thread.start()

task._thread is assigned before start() is called. The janitor thread _cleanup_threads() flags any task where task._thread is not None and not task._thread.is_alive(). A just-constructed, not-yet-started thread is not alive, so if the janitor's 0.05s poll lands in the window between Thread() construction and start(), it wrongly fails the task with "thread death" — even though the thread is about to run fine.

Fix

Build the thread into a local, start it, and assign task._thread only after start() returns, so the janitor can never observe a referenced thread that isn't yet alive:

t = Thread(target=task._run, name=f"Appose-{uuid}")
t.start()
task._thread = t

Reproduction & verification

Added test_thread_death_stress, which floods the worker with 16 threads × 200 tiny tasks (none of which can legitimately die), so any "thread death" is the bug.

  • Before the fix: failed on every run (e.g. 5/3200, 8/3200, 4/3200 tasks failing, all with "thread death").
  • After the fix: passes consistently (6/6 consecutive runs green).

Full tests/test_service.py Python suite passes with no regressions, including test_python_sys_exit (a genuine sys.exit in a task is now caught and reported as SystemExit: 123, which the test allows).

In Worker._process_input(), non-main tasks assigned task._thread before
calling start(). The janitor thread (_cleanup_threads) flags any task
where task._thread is set and the thread is not alive. A
just-constructed, not-yet-started thread is not alive, so if the
janitor's 0.05s poll landed in the window between Thread() construction
and start(), it would wrongly fail the task with "thread death" even
though the thread was about to run fine.

Assign task._thread only after start() returns, closing the window.

Also add test_thread_death_stress, which floods the worker with many
concurrent tiny tasks (none of which can legitimately die) to surface
the race. It failed reliably before the fix and passes consistently
after.

Fixes apposed/appose#15.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ctrueden ctrueden force-pushed the fix/thread-death-toctou-issue-15 branch from b3a97fa to fbc25e0 Compare June 4, 2026 18:32
@ctrueden ctrueden merged commit e44d688 into apposed:main Jun 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants