Skip to content

perf(tracing): skip span-start upsert by default (end-only ingest)#394

Draft
NiteshDhanpal wants to merge 2 commits into
mainfrom
tracing-skip-span-start
Draft

perf(tracing): skip span-start upsert by default (end-only ingest)#394
NiteshDhanpal wants to merge 2 commits into
mainfrom
tracing-skip-span-start

Conversation

@NiteshDhanpal
Copy link
Copy Markdown

@NiteshDhanpal NiteshDhanpal commented Jun 4, 2026

Summary

Move the "end-only span ingest" optimization (originally done on the EGP backend in scaleapi#145863) into the SDK producer, so the wasted span-start write is never sent on the wire in the first place.

Tracing writes each span twice — once on start (no end_time) and once on end. The start row is only ever overwritten by the end write moments later, so persisting it:

  • doubles span-ingest write volume (HTTP + DB), and
  • on the SGP backend costs a non-HOT UPDATE per span (the joined_data_tsvector GIN index is recomputed when output arrives on the end write) plus a dead tuple.

This change makes SGPSyncTracingProcessor / SGPAsyncTracingProcessor skip the span-start upsert, so each span is persisted once, on end (a single INSERT server-side). Doing it in the SDK also eliminates the wasted start HTTP call entirely — not just the DB write.

Behavior

  • Default ON (skip span-start). on_span_start (sync) and on_spans_start (async) become no-ops by default.
  • Set AGENTEX_TRACING_SKIP_SPAN_START=0 (false/no/off also work) to restore the start write — e.g. if you need in-flight spans visible before they complete, or spans that never end (process crash) to still be persisted.
  • on_span_end / on_spans_end are unchanged; the end write already carries the full span (start_time + end_time + input/output), so nothing is lost in the default path for spans that complete normally.

Parsing mirrors the SDK's existing AGENTEX_TRACING_METRICS convention (raw not in ("0","false","no","off")).

Trade-offs

  • In-flight spans are not visible until they complete.
  • Spans that never end (e.g. a crash before on_span_end) are not persisted.

Both are reversible per-deployment via the env var.

Tests

  • New: on_span_start/on_spans_start are no-ops by default; emit when skip is disabled; _skip_span_start_enabled() env parsing (default + falsy/other values).
  • Updated the existing two-write lifecycle / start-batching / start-metrics tests to set AGENTEX_TRACING_SKIP_SPAN_START=0 so they continue to exercise the start path.
  • All 31 tests in test_sgp_tracing_processor.py pass; ruff check clean.

Note

Branched off v0.11.8 per request; rebased onto main (0.12.0) for a clean PR — the processor file and its tests are byte-identical between the tag and main, so the code change is exactly the same.

🤖 Generated with Claude Code

Greptile Summary

This PR moves the "end-only span ingest" optimization into the SDK producer itself, so the wasted span-start HTTP write is never sent on the wire. Each span is now persisted exactly once — on end — carrying the full payload (start_time, end_time, input, output), eliminating the non-HOT UPDATE / dead-tuple cost on the SGP backend.

  • _skip_span_start_enabled() reads AGENTEX_TRACING_SKIP_SPAN_START on every call (mirroring the existing AGENTEX_TRACING_METRICS convention), defaults to ON, and can be reverted per-deployment by setting the var to 0/false/no/off.
  • Both SGPSyncTracingProcessor.on_span_start and SGPAsyncTracingProcessor.on_spans_start gain an early-return guard; on_span_end / on_spans_end are untouched.
  • 6 new tests cover the default-skip and opt-in paths, plus env-var parsing edge cases; 5 existing tests are backfilled with AGENTEX_TRACING_SKIP_SPAN_START=0 to keep exercising the start-write path.

Confidence Score: 5/5

Safe to merge — the change is a simple early-return guard on the span-start path with no effect on span-end writes, and the opt-out env var is documented and tested.

The implementation is minimal: two early-return guards, one helper function reading an env var, and startup log lines. The end-write path (which carries the full span payload) is completely unchanged. Tests cover the default behavior, the opt-in restore path, and env-var parsing including case/whitespace variants. The existing cross-pod (end-without-start) and batching tests continue to pass. There are no data-loss risks for spans that complete normally, and the trade-offs for in-flight or crashed spans are documented and reversible.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Adds _skip_span_start_enabled() env-var guard (default ON) that makes on_span_start / on_spans_start early-return no-ops; on_span_end / on_spans_end are unchanged. Startup log added to both processor __init__s.
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py Adds 6 new tests covering default-skip, explicit-enable, and env-var parsing; backfills monkeypatch.setenv("AGENTEX_TRACING_SKIP_SPAN_START", "0") on the 5 existing tests that exercise the start-write path. All 31 tests remain valid.

Sequence Diagram

sequenceDiagram
    participant SDK as SDK Processor
    participant Env as os.environ
    participant SGP as SGP Backend

    Note over SDK: Span starts

    SDK->>Env: _skip_span_start_enabled()?
    alt AGENTEX_TRACING_SKIP_SPAN_START unset or truthy (default)
        Env-->>SDK: True — early return (no-op)
        Note over SGP: No HTTP write for span start
    else "AGENTEX_TRACING_SKIP_SPAN_START=0/false/no/off"
        Env-->>SDK: False — proceed
        SDK->>SGP: upsert span (no end_time)
    end

    Note over SDK: Span ends (always)
    SDK->>SGP: upsert span (start_time + end_time + input/output)
    Note over SGP: Single INSERT (clean row, no dead tuple)
Loading

Reviews (2): Last reviewed commit: "observability: log resolved span-start m..." | Re-trigger Greptile

Tracing writes each span twice — once on start (no end_time) and once on
end — so the start row is only ever overwritten by the end write moments
later. Persisting it doubles span-ingest write volume and, on the SGP
backend, costs a non-HOT UPDATE (tsvector/GIN recompute + index churn) plus
a dead tuple per span.

Skip the span-start upsert by default so each span is persisted once, on end
(a single INSERT). Set AGENTEX_TRACING_SKIP_SPAN_START=0/false/no/off to
restore the start write when in-flight or never-ending spans must be visible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

This PR is targeting main, but PRs should target the next branch by default.

The main branch is reserved for release-please and Stainless automation. To resolve, pick one of:

  • Re-target the PR to next (recommended). On the PR page, click Edit next to the title and change the base branch to next.
  • Add the target-main label if this is an intentional exception (e.g. an urgent hotfix). The check will re-run and pass.

See CONTRIBUTING.md for the full branch model.

The end-only skip is governed by AGENTEX_TRACING_SKIP_SPAN_START (default ON)
but was silent — an operator could only infer it from the absence of
start-export metrics. Emit a one-time INFO at processor init stating whether
span-start upsert is enabled or skipped, so the deployment's tracing mode is
visible in logs. Off the hot path (once per construction).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant