fix(core): define RunEvent schema and update ApiClient validation by deepshekhardas · Pull Request #3815 · triggerdotdev/trigger.dev

deepshekhardas · 2026-06-03T05:56:44Z

Add RunEvent, TaskEventLevel, ListRunEventsResponse schemas. Closes #3220

…t build server failures (triggerdotdev#2913)

)

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).

Addresses PR review feedback: - LogicalReplicationClient.subscribe() can throw before its internal "error" listener is wired up (notably when pg client.connect() fails mid-failover). The reconnect strategy's catch block only logged, so recovery silently stopped. Now also calls scheduleReconnect(err) — the pendingReconnect guard makes it idempotent if an error event was also emitted. - Reject negative values for the new replication-recovery env vars and cap exit codes at 255. - Convert the new ReplicationErrorRecovery{Deps,} interfaces to type aliases to match the repo's TypeScript style. - Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn" reference (the wrapper-tracked resume LSN is what callers actually pass). - Restore process.exit after service.shutdown() in the exit-strategy test so a delayed exit timer can't terminate the test worker.

LogicalReplicationClient.subscribe() can resolve without throwing or emitting an "error" event when leader-lock acquisition fails — it just calls this.stop() and returns. The reconnect callback now checks isStopped after subscribe() and throws so the recovery handler can schedule the next attempt instead of silently giving up.

…rough handle() The previous post-subscribe() isStopped check was always true on the happy path: subscribe() calls stop() up front (setting _isStopped=true) and only resets the flag inside the replicationStart event, which fires asynchronously after subscribe() returns. So the check threw on every successful reconnect, the catch rescheduled, the next attempt tore down the just-built client, and the cycle continued — replication briefly worked between teardowns, which is why the integration test passed. Replace it with the correct nudge: subscribe to leaderElection and call the recovery handler on isLeader=false. That's the only subscribe() exit path that doesn't either throw or emit an "error" event (the other silent-return paths emit "error" first via createPublication/createSlot failures).

The previous commit routed leaderElection(false) through handle(), which under the exit strategy schedules process.exit. In a multi-instance deployment that turns lost leader election — a normal operational state — into a restart loop: exit, supervisor restarts, election fails again, exit, and so on. Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery that the reconnect strategy treats as another retry trigger, while exit and log strategies no-op. Wire the wrapper services through the new method.

fix(webapp): auto-recover replication services after stream errors

…riggerdotdev#3220) - Add RunEvent, TaskEventLevel schemas to api.ts - Add ListRunEventsResponse schema - Add RunEvent import to ApiClient - Add changeset for define-runevent-schema Closes triggerdotdev#3220

changeset-bot · 2026-06-03T05:56:50Z

🦋 Changeset detected

Latest commit: ab7e522

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-06-03T05:57:00Z

Hi @deepshekhardas, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details.

coderabbitai · 2026-06-03T05:57:00Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 47d26656-c410-4314-bd5d-7b4b0b2bf17b

📥 Commits

Reviewing files that changed from the base of the PR and between a9f756b and ab7e522.

📒 Files selected for processing (33)

.changeset/define-runevent-schema.md
.changeset/fix-console-interceptor-2900.md
.changeset/fix-docker-hub-rate-limit-2911.md
.changeset/fix-github-install-node-version-2913.md
.changeset/fix-orphaned-workers-2909.md
.changeset/fix-sentry-oom-2920.md
.server-changes/replication-error-recovery.md
apps/webapp/app/env.server.ts
apps/webapp/app/services/replicationErrorRecovery.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/sessionsReplicationInstance.server.ts
apps/webapp/app/services/sessionsReplicationService.server.ts
apps/webapp/test/runsReplicationService.errorRecovery.test.ts
consolidated_pr_body.md
packages/cli-v3/src/cli/common.ts
packages/cli-v3/src/commands/deploy.ts
packages/cli-v3/src/commands/dev.ts
packages/cli-v3/src/commands/login.ts
packages/cli-v3/src/commands/update.test.ts
packages/cli-v3/src/commands/update.ts
packages/cli-v3/src/deploy/buildImage.ts
packages/cli-v3/src/entryPoints/dev-index-worker.ts
packages/cli-v3/src/entryPoints/dev-run-worker.ts
packages/cli-v3/src/entryPoints/managed-index-worker.ts
packages/cli-v3/src/entryPoints/managed-run-worker.ts
packages/cli-v3/src/utilities/sourceMaps.test.ts
packages/cli-v3/src/utilities/sourceMaps.ts
packages/core/src/v3/apiClient/index.ts
packages/core/src/v3/consoleInterceptor.ts
packages/core/src/v3/realtimeStreams/streamsWriterV2.ts
packages/core/src/v3/schemas/api-type.test.ts
packages/core/src/v3/schemas/api.ts

Walkthrough

This PR introduces replication error recovery for logical streams with three configurable strategies (reconnect via exponential backoff, process exit for external supervision, or log-only), defines RunEvent schemas for API event streaming with validation, improves console log interception to preserve interceptor chains (e.g., Sentry), and implements multiple CLI stability fixes including signal-based process cleanup for the dev command, Docker Hub pre-build authentication, strict engine checks during deployment, and centralized source map support management across worker entrypoints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch feat/run-event-schema

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 4 potential issues.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-06-03T06:00:43Z

@@ -260,8 +261,7 @@ export async function updateTriggerPackages(
    await installDependencies({ cwd: projectPath, silent: true });


🔴 ignoreEngines option is declared and passed but never used in installDependencies call

The ignoreEngines option is added to CommonCommandOptions and UpdateCommandOptions, and deploy.ts:262 passes ignoreEngines: true to updateTriggerPackages. However, the actual installDependencies call at packages/cli-v3/src/commands/update.ts:261 is still await installDependencies({ cwd: projectPath, silent: true }) — it never reads options.ignoreEngines and never passes the engine-ignoring args (e.g. --no-engine-strict for npm, --config.engine-strict=false for pnpm, --ignore-engines for yarn). The tests in update.test.ts assert that these args are passed, but the production code doesn't implement it, so the stated fix for issue #2913 (deployment failures due to Node version engine checks) is completely inert.

Prompt for agents

In packages/cli-v3/src/commands/update.ts, the `installDependencies` call on line 261 needs to be updated to pass engine-ignoring args when `options.ignoreEngines` is true. The package manager needs to be detected first (it already is, on line 254), and then the appropriate flag should be passed: - npm: args: ["--no-engine-strict"] - pnpm: args: ["--config.engine-strict=false"] - yarn: args: ["--ignore-engines"] - default: args: [] The logic should compute an `args` array based on `options.ignoreEngines` and `packageManager.name`, then pass it as `await installDependencies({ cwd: projectPath, silent: true, args })`. The tests in update.test.ts already expect this behavior. The `detectPackageManager` call on line 254 should be moved before the `installDependencies` call if it isn't already (it is), and the args should be constructed between lines 254 and 261.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-06-03T06:00:45Z

+        switch (severityNumber) {
+          case SeverityNumber.INFO:
+            this.originalConsole.log(...args);
+            break;
+          case SeverityNumber.WARN:
+            this.originalConsole.warn(...args);
+            break;
+          case SeverityNumber.ERROR:
+            this.originalConsole.error(...args);
+            break;
+          case SeverityNumber.DEBUG:
+            this.originalConsole.debug(...args);
+            break;
+          default:
+            this.originalConsole.log(...args);
+            break;
+        }


🔴 console.info() is routed to originalConsole.log() instead of originalConsole.info()

In the new ConsoleInterceptor delegation logic, both log() and info() use SeverityNumber.INFO. The switch statement at line 97-113 only has a single case SeverityNumber.INFO: which delegates to this.originalConsole.log(...). This means calls to console.info() are forwarded to the saved console.log instead of the saved console.info. This breaks the log chain for interceptors like Sentry that specifically hook console.info — the exact scenario this fix (#2900) is supposed to address. The info severity text ("Info") is distinct from log's ("Log"), so the method could differentiate, or separate severity numbers should be used.

Prompt for agents

In packages/core/src/v3/consoleInterceptor.ts, the switch statement in #handleLog (lines 97-113) dispatches based on SeverityNumber, but both log() and info() use SeverityNumber.INFO. This means console.info() calls get routed through originalConsole.log() instead of originalConsole.info(), defeating the purpose of preserving the log chain for Sentry. The fix needs to differentiate between log and info calls. Options: 1. Pass the severityText along and use it to disambiguate within the SeverityNumber.INFO case (e.g., if severityText === 'Info' use originalConsole.info, otherwise originalConsole.log). 2. Use a different approach like passing the original method name through #handleLog. The key insight is that SeverityNumber is not granular enough to distinguish log from info — both map to SeverityNumber.INFO — so additional context is needed.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-06-03T06:00:46Z

  const [readSessionError, readSession] = await tryCatch(
    stream.readSession(
      {
-        start: { from: { seqNum: 0 }, clamp: true },
-        stop: { waitSecs: 60 * 20 }, // 20 minutes
+        start: {
+          from: { seqNum: 0 },
+        },
      },
      { signal: abortController.signal }
    )


🚩 S2 readSession: clamp and stop options removed silently

The readSession call in deploy.ts:1148-1157 previously used clamp: true and stop: { waitSecs: 60 * 20 } (20-minute timeout). Both were removed. The clamp option told S2 to clamp the start position to the earliest available sequence number if it no longer exists; without it, reading from seqNum 0 on a trimmed stream may error. The stop.waitSecs provided a hard timeout so the CLI wouldn't hang forever if the build server never emits a finalized event. Without it, the read session relies solely on the abortController.abort() in the finalized handler to terminate. If the server never sends a finalized event (e.g. crash), the CLI could hang indefinitely. This may be intentional if the S2 SDK now handles these defaults, but it warrants verification.

(Refers to lines 1148-1157)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-06-03T06:00:47Z

+  runId: z.string(),
+  message: z.string(),
+  style: TaskEventStyle,
+  startTime: z.coerce.date(),


🚩 RunEvent schema uses z.coerce.date() which accepts any numeric string as a date

The RunEvent schema at api.ts:2005 uses z.coerce.date() for startTime. While the tests demonstrate this works for ISO strings, Date objects, and nanosecond strings, z.coerce.date() also accepts arbitrary numeric strings (e.g. '42' becomes a valid Date). The test at line 192-197 in api-type.test.ts validates 19-digit nanosecond strings, but z.coerce.date() would interpret '1710374400000000000' as year 1710374400000000000 via new Date('1710374400000000000') which yields Invalid Date in most JS engines — the test may actually fail. The test claims it should produce 2024-03-14T00:00:00.000Z but z.coerce.date() doesn't have special nanosecond parsing logic. This should be verified.

Was this helpful? React with 👍 or 👎 to provide feedback.

Deploy Bot and others added 20 commits February 2, 2026 16:16

fix(cli-v3): allow disabling source-map-support to prevent OOM with S…

ed41f0a

…entry (triggerdotdev#2920)

fix(cli-v3): ignore engine checks during deployment install to preven…

023c3fd

…t build server failures (triggerdotdev#2913)

fix(core): delegate to original console in ConsoleInterceptor to pres…

93aa053

…erve log chain (triggerdotdev#2900)

fix(cli-v3): authenticate to Docker Hub to prevent rate limits (trigg…

8b684e1

…erdotdev#2911)

fix(cli-v3): ensure worker cleanup on SIGINT/SIGTERM (triggerdotdev#2909

737ad56

)

verify: add reproduction scripts and PR details for all major fixes

c97cbcc

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

verify: add reproduction scripts and PR details for all major fixes

aa90db9

- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913) - Include PR body drafts for consolidated tracking

docs: add consolidated PR body description

f5ce2bc

chore: remove reproduction scripts and temporary files

8c986db

Merge remote-tracking branch 'remotes/origin/fix/sentry-oom-2920'

82f198f

Merge branch 'fix/issue-2909-orphaned-workers'

9a3e8d0

chore: remove reproduction scripts after verification

e101f8e

fix: resolve typecheck errors after merge

d01d438

Merge pull request #10 from deepshekhardas/pr/3613-replication-fix

d35bf04

fix(webapp): auto-recover replication services after stream errors

fix(core): define RunEvent schema and update ApiClient validation (PR t…

ab7e522

…riggerdotdev#3220) - Add RunEvent, TaskEventLevel schemas to api.ts - Add ListRunEventsResponse schema - Add RunEvent import to ApiClient - Add changeset for define-runevent-schema Closes triggerdotdev#3220

github-actions Bot closed this Jun 3, 2026

devin-ai-integration Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(core): define RunEvent schema and update ApiClient validation#3815

fix(core): define RunEvent schema and update ApiClient validation#3815
deepshekhardas wants to merge 20 commits into
triggerdotdev:mainfrom
deepshekhardas:feat/run-event-schema

deepshekhardas commented Jun 3, 2026

Uh oh!

changeset-bot Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review failed

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -260,8 +261,7 @@ export async function updateTriggerPackages(
		await installDependencies({ cwd: projectPath, silent: true });

Uh oh!

Conversation

deepshekhardas commented Jun 3, 2026

Uh oh!

changeset-bot Bot commented Jun 3, 2026

🦋 Changeset detected

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Estimated code review effort

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading