Skip to content

fix(tables): surface real error causes on cell-execution failures (diagnostics)#4868

Merged
TheodoreSpeaks merged 6 commits into
stagingfrom
fix/cell-db-failure
Jun 3, 2026
Merged

fix(tables): surface real error causes on cell-execution failures (diagnostics)#4868
TheodoreSpeaks merged 6 commits into
stagingfrom
fix/cell-db-failure

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented Jun 3, 2026

Problem

workflow-group-cell runs intermittently fail on trivial DB reads/writes under heavy fan-out, leaving cells stranded in running. Investigation showed the PlanetScale and ElastiCache backends were healthy at the time, and the cell doesn't checkpoint — so these are transient connection-level faults. But we could never identify the actual cause: the catch blocks log toError(error).message, which for a Drizzle error is the wrapper "Failed query: ..." — the real driver cause (ECONNRESET / statement timeout / AbortError/…) lives in error.cause, which is never logged (and @sim/logger drops the non-enumerable .cause).

Approach: diagnostics-first (no retry)

We deliberately do not add in-process retry. It's the wrong layer here — the task is maxAttempts: 1 by design, retrying on a possibly-degraded worker may not help, and it masks the very signal we're trying to capture before the root cause is understood. Instead: catch, log the real cause, fail fast. Once the logs tell us what's actually happening, we decide the real fix from evidence (e.g. task-level maxAttempts, or addressing the worker-side cause).

Changes

  • describeError() in lib/core/errors/retryable-infrastructure.ts — walks the .cause chain (reuses the existing getErrorChain) and always returns the underlying driver cause (code/errno/syscall + causeChain), including for unclassified errors like AbortError.
  • Added cause: describeError(err) + a retryable: isRetryableInfrastructureError(err) classification flag across the cell + finalization catch blocks (workflow-column-execution.ts, logging-session.ts, pause-persistence.ts), mirroring the existing schedule-execution.ts pattern. Logging-only; no behavior change.

Testing

  • retryable-infrastructure.test.ts (5) — cause-chain walk, AbortError, non-Error input, depth cap. All pass.
  • biome clean; full non-incremental tsc clean (incremental tsc had previously masked a scope error — verified with --incremental false).

Notes

  • Diagnostics-only by design: this does not stop a transient blip from stranding a cell — it makes the cause visible so the right fix can be chosen next. Zombie cleanup of already-stranded cells is a separate follow-up.

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jun 3, 2026 9:55pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 3, 2026

PR Summary

Low Risk
Observability-only logging changes; no control-flow, auth, or retry policy changes.

Overview
Adds describeError() in @sim/utils/errors to walk .cause chains and expose the deepest driver fields (code, errno, syscall) plus an optional causeChain, because loggers only see wrapper messages like Drizzle’s "Failed query: ...".

Logging-only updates across schedule execution, workflow group cell execution, execution logging session, and pause persistence: error logs now include structured cause: describeError(err) and, where relevant, retryable: isRetryableInfrastructureError(err). No retry or execution behavior changes.

Vitest coverage for describeError (wrapped driver errors, AbortError, cycles, depth cap).

Reviewed by Cursor Bugbot for commit 332223b. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/background/workflow-column-execution.ts Outdated
Comment thread apps/sim/lib/table/retry-transient.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR adds diagnostics-only logging for transient DB/network failures in cell-execution paths. The core addition is describeError(), which walks the .cause chain and surfaces the deepest driver-level error fields (code/errno/syscall) that @sim/logger previously dropped because Error.prototype.cause is non-enumerable. No retry logic is added; catch blocks now emit the richer structured fields alongside the existing error message string.

  • retryable-infrastructure.ts: adds describeError() (reuses the existing getErrorChain helper) and the DescribedError interface; cycles are bounded by the pre-existing depth-10 guard.
  • retryable-infrastructure.test.ts: five new tests cover single errors, wrapped driver errors, AbortError, non-Error inputs, and cyclic cause chains.
  • workflow-column-execution.ts, logging-session.ts, pause-persistence.ts: each catch block now passes cause: describeError(err) and retryable: isRetryableInfrastructureError(err) as explicit structured log fields.

Confidence Score: 5/5

Safe to merge — all changes are purely additive to log payloads with no behaviour modifications.

The change is logging-only: describeError cannot throw (the test confirms this for non-Error inputs and cycles), and its output is always a plain object. Every call site passes the result as an extra field in an existing log statement; none of the surrounding error handling or control flow is altered. The five new unit tests cover the cases that matter for correctness.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/lib/core/errors/retryable-infrastructure.ts New describeError() function walks the .cause chain via the existing getErrorChain helper, reports the deepest link's diagnostics fields, and guards against cycles with a depth-10 cap. Logic is correct for all tested paths.
apps/sim/lib/core/errors/retryable-infrastructure.test.ts New test file covers single error, wrapped driver error (ECONNRESET), AbortError, non-Error input, and cyclic cause chain — all five cases are meaningful and distinct.
apps/sim/background/workflow-column-execution.ts Three catch blocks enriched with cause: describeError(err) and retryable: isRetryableInfrastructureError(err) fields; purely additive to log payloads, no behaviour change.
apps/sim/lib/logs/execution/logging-session.ts Four catch blocks in LoggingSession now emit cause/retryable alongside the existing error string; pattern is consistent across all sites.
apps/sim/lib/workflows/executor/pause-persistence.ts Two catch blocks in handlePostExecutionPauseState enriched with the same cause/retryable diagnostic fields; no behaviour change.

Reviews (5): Last reviewed commit: "refactor(tables): drop in-process retry,..." | Re-trigger Greptile

…surface error causes

Workflow-group-cell runs intermittently failed on trivial DB reads/writes
under heavy fan-out, stranding cells in `running`. Investigation showed the
PlanetScale and ElastiCache backends were healthy at the time — the failures
are transient connection-level faults that the cell (maxAttempts: 1) had no
tolerance for, and the real cause was never logged (Drizzle wraps it as
"Failed query: ..." and the driver cause lives in error.cause).

Resilience:
- Add retryTransient (lib/table/retry-transient.ts): retries only transient
  infra errors (reuses isRetryableInfrastructureError; adds an ioredis
  command-timeout match) with jittered backoff, then rethrows. Fail-fast for
  everything else.
- Wrap the cell's getTableById/getRowById reads, the terminal write
  (cell-write updateRow — idempotent via the executionId guard), and the
  Redis cascade-lock acquire.

Diagnostics:
- Add describeError (lib/core/errors/retryable-infrastructure.ts): walks the
  .cause chain and always returns the underlying driver cause (code/errno/
  syscall + causeChain), including for unclassified errors like AbortError.
- Log `cause` + a `retryable` flag (and aborted/timedOut in the cell's main
  catch) across the cell + finalization error paths, mirroring the existing
  schedule-execution pattern. Logging-only; no behavior change. This lets the
  next recurrence reveal the real cause and whether the retry applies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread apps/sim/lib/table/retry-transient.ts Outdated
Comment thread apps/sim/lib/table/retry-transient.ts Outdated
@TheodoreSpeaks TheodoreSpeaks changed the base branch from main to staging June 3, 2026 20:02
Comment thread apps/sim/background/workflow-column-execution.ts Outdated
Comment thread apps/sim/lib/table/retry-transient.ts Outdated
Comment thread apps/sim/lib/table/retry-transient.ts Outdated
Comment thread apps/sim/lib/table/cascade-lock.ts Outdated
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/table/cascade-lock.ts Outdated
- retryTransient: re-check the abort signal after the backoff sleep so a
  cancellation during sleep stops the next attempt (don't run/return work for
  an already-cancelled task).
- isRetryableRedisError: walk the .cause chain (mirroring the infra
  classifier) so wrapped Redis timeouts are recognized; drop "Connection is in
  subscriber mode" — that's a connection-state programming error, not a
  transient drop, and would just fail identically every retry.
- cascade-lock: stop wrapping acquireLock in retryTransient. acquireLock is a
  non-idempotent SET NX, so retrying after a timed-out-but-applied first SET
  returns false (key already ours) and yields a false `contended` that skips
  the cascade. A transient Redis blip here just fails the run before pickup
  (no stranded cell); the dispatcher re-drives it.
- Tests: cause-chain Redis match, subscriber-mode exclusion, abort-during-sleep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews — addressed in 8812154:

Fixed

  • Signal not re-checked after backoff sleep (Greptile P1 / Cursor): now re-checks signal.aborted after sleep, so a cancellation mid-backoff stops the next attempt. Added a test.
  • "Connection is in subscriber mode" retried (Greptile P2): removed — it's a connection-state programming error, not a transient drop. Added an exclusion test.
  • Redis matcher skips cause chain (Cursor): isRetryableRedisError now walks the .cause chain (depth-bounded), mirroring the infra classifier. Added a wrapped-cause test.
  • Lock retry → false contention (Cursor): stopped wrapping acquireLock in retryTransient. It's a non-idempotent SET NX, so retrying after a timed-out-but-applied first SET returns false (key already ours) → a false contended that skips the cascade. A transient blip there now just fails the run before pickup (no stranded cell); the dispatcher re-drives it.

Not a bug (won't change)

  • Catch references uninitialized abortSignal/timeoutController (Cursor, High): the catch that logs aborted/timedOut closes the inner try (opened at L663, around executeWorkflow), which begins after both bindings are declared (L657–658). Failures before that point are handled by the enclosing try, not this catch. tsc confirms the bindings are in scope and definitely assigned at that catch (no TS2304/TS2454), so there's no reachable path where they're uninitialized.

@greptile review

The main catch logged `aborted`/`timedOut` from `abortSignal`/`timeoutController`,
but those are declared inside the outer try block (the inner try around
executeWorkflow is try/finally, so this catch belongs to the outer try) and are
not in scope in the catch — `next build`'s type-check failed with "Cannot find
name 'abortSignal'". Local incremental `tsc --noEmit` had skipped the file and
falsely passed; the Cursor/Greptile reviewers flagged this correctly.

Removed the two fields. Abort/timeout is still surfaced via `cause:
describeError(err)` (an aborted run shows `name: 'AbortError'` / the timeout
message), so no diagnostic signal is lost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Correction on the abort-vars finding — you were right, I was wrong. I claimed it was a false positive based on a local tsc --noEmit that passed, but that was incremental tsc skipping the file. CI's next build type-check correctly failed with Cannot find name 'abortSignal': the inner try around executeWorkflow is try/finally, so the logging catch belongs to the outer try, where abortSignal/timeoutController (declared inside it) are out of scope.

Fixed in the latest commit by removing aborted/timedOut from that catch. Abort/timeout is still captured via cause: describeError(err) (name: 'AbortError' / the timeout message), so no signal is lost. Thanks for catching it.

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Greptile's P1 on acquireLock (non-idempotent SET NX wrapped in retry) was already resolved in 8812154 — the retryTransient wrapper was removed, so it's now a plain acquireLock call. @greptile review

In-process retry is the wrong layer for this path: the cell task is
maxAttempts:1 by design, retrying on a possibly-degraded worker may not help,
and it masks the very transient-failure signal we're trying to capture before
we understand the root cause. Removed retryTransient entirely (file + all
wrapping in cell-write, the cascade reads, and the lock acquire) and kept only
the diagnostic logging.

- Deleted lib/table/retry-transient.ts (+ test); cell-write and the cascade
  reads call getTableById/getRowById/updateRow directly again, fail-fast.
- Kept describeError + `cause`/`retryable` fields across the cell + finalization
  catch blocks; the cell-path `retryable` flag now sources from
  isRetryableInfrastructureError (the canonical classifier) for consistency.

Diagnostics-first: surface the real driver cause on the next recurrence, then
decide the actual fix (e.g. task-level maxAttempts, or addressing the worker-
side cause) from evidence rather than a speculative in-process retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@TheodoreSpeaks TheodoreSpeaks changed the title fix(tables): retry transient DB/Redis failures in cell execution and surface error causes fix(tables): surface real error causes on cell-execution failures (diagnostics) Jun 3, 2026
@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

Simplified per discussion: removed the in-process retry entirely (wrong layer for a maxAttempts: 1 task, and premature before we know the root cause) and kept only the cause diagnostics. −240 lines. @greptile review

Comment thread apps/sim/background/workflow-column-execution.ts
The scheduled-job failure paths logged the raw error (.message/stack only) —
its `.cause` (the real driver error behind a Drizzle "Failed query: ..."
wrapper) was never recorded, and the classified-only
`describeRetryableInfrastructureError` returns undefined for unrecognized
errors. A real failed run (same incident window as the cell failures) failed in
`applyScheduleUpdate` with exactly this unrecorded cause.

Added `cause: describeError(error)` (always-on, walks the cause chain) to the
applyScheduleUpdate catch, the early-failure catch, and the unhandled-error
catch — passed as a second arg so the existing message+stack still emit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`describeError` is a general-purpose error/cause-chain helper — it didn't
belong in `lib/core/errors/retryable-infrastructure.ts` (that module is
specifically about classifying retryable infra errors, and the name read wrong
for a generic diagnostic). Moved it to `@sim/utils/errors` alongside `toError`/
`getErrorMessage`/`getPostgresErrorCode`, with its own cycle-safe cause walk.

- Added describeError + DescribedError + tests to packages/utils/src/errors.ts.
- Reverted the describeError addition from retryable-infrastructure.ts (it keeps
  only isRetryableInfrastructureError / describeRetryableInfrastructureError,
  which are accurately named and still used by the schedule retry path).
- Re-pointed all consumers (cell, logging-session, pause-persistence, schedule)
  to import describeError from @sim/utils/errors. The `retryable` classification
  flag still sources from isRetryableInfrastructureError where used.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 332223b. Configure here.

typeof value === 'string' ? value : undefined
const code = asString(deepest.code)
const errno = asString(deepest.errno)
const syscall = asString(deepest.syscall)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric errno omitted from logs

Low Severity

describeError only copies errno when it is a string, but Node SystemError values typically expose errno as a number. Structured logs can then omit errno even when the driver set it, weakening the diagnostics this change is meant to add.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 332223b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant