Skip to content

perf(tracing): raise span-queue batch defaults and make batch_size env-tunable#395

Open
NiteshDhanpal wants to merge 1 commit into
mainfrom
span-queue-batch-tuning
Open

perf(tracing): raise span-queue batch defaults and make batch_size env-tunable#395
NiteshDhanpal wants to merge 1 commit into
mainfrom
span-queue-batch-tuning

Conversation

@NiteshDhanpal
Copy link
Copy Markdown

@NiteshDhanpal NiteshDhanpal commented Jun 4, 2026

Summary

Tune the async span-queue's batching so high-volume span ingest sends fewer, fuller batches, and make the batch size configurable per-deploy.

Defaults raised (span_queue.py):

  • _DEFAULT_BATCH_SIZE 50 → 200
  • _DEFAULT_LINGER_MS 100 → 250

At 50 spans / 100ms, the queue ships many small upsert_batch PUTs — each a separate HTTP round trip and a separate INSERT ... ON CONFLICT statement on the backend, and (downstream) more, smaller ClickHouse parts to merge. Spans typically arrive a few ms apart, so a 100ms linger rarely fills a batch. Raising to 200 / 250ms lets batches fill before flushing, amortizing per-request and per-statement overhead. 200 stays well under the backend's 1000-row batch cap.

batch_size is now env-tunable. It was the only queue knob without an AGENTEX_SPAN_QUEUE_* override — every other parameter (linger_ms, max_size, max_retries, concurrency) already reads one. Added AGENTEX_SPAN_QUEUE_BATCH_SIZE with the same _read_int_env pattern, so the batch size can be tuned per-deploy without an SDK release.

Resolution order (matches the other knobs): explicit constructor arg > AGENTEX_SPAN_QUEUE_BATCH_SIZE env > default, clamped to a minimum of 1.

Trade-offs

Larger batches + longer linger slightly increase worst-case in-memory dwell and the loss window if a producer crashes before a flush (bounded by linger + queue semantics). The values keep worst-case ingest latency sub-second.

Tests

TestAsyncSpanQueueBatchSizeConfig: default, explicit-arg override, min-1 clamp, env override, explicit-arg-beats-env, and invalid-env-falls-back-to-default. All 37 tests in test_span_queue.py pass; ruff check clean.

Note

Independent of #394 (skip span-start upsert) — that changes whether a write happens; this changes how writes are batched. Branched off main.

🤖 Generated with Claude Code

Greptile Summary

This PR raises the default batch_size (50→200) and linger_ms (100→250) for the async span queue to reduce the number of small upsert_batch HTTP round trips under high span volume. It also closes the only knob that lacked an AGENTEX_SPAN_QUEUE_* env override by adding AGENTEX_SPAN_QUEUE_BATCH_SIZE, following the identical resolution pattern (explicit arg > env > default, clamped to minimum 1) used by every other parameter.

  • span_queue.py: Default constants raised with detailed rationale comments; batch_size constructor parameter changed from int = _DEFAULT_BATCH_SIZE to int | None = None to support env-driven resolution.
  • test_span_queue.py: New TestAsyncSpanQueueBatchSizeConfig class covers default, explicit override, min-1 clamp, env override, explicit-beats-env, and invalid-env-fallback scenarios, all consistent with patterns used for other queue knobs.

Confidence Score: 5/5

Safe to merge — the changes are limited to tuning constants and adding a missing env-var override that follows an established, well-tested pattern already used by every other queue knob.

Both changes are narrow and mechanical: the constant bumps are well-justified and bounded below the backend's documented 1000-row cap, and the new env-override path mirrors the identical pattern applied to linger_ms, max_retries, and concurrency. Six dedicated tests cover all resolution-order branches, and asyncio_mode=auto ensures they actually run.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/span_queue.py Default constants raised (batch_size 50→200, linger_ms 100→250); batch_size constructor parameter made nullable to support env-var resolution, fully consistent with the existing pattern for all other queue knobs.
tests/lib/core/tracing/test_span_queue.py New TestAsyncSpanQueueBatchSizeConfig class with 6 focused tests; asyncio_mode=auto is configured so async def methods run correctly; covers all resolution-order branches for the new env knob.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AsyncSpanQueue.__init__ called] --> B{batch_size arg provided?}
    B -- "None (default)" --> C["Read AGENTEX_SPAN_QUEUE_BATCH_SIZE env"]
    C --> D{Env var set?}
    D -- "Yes, valid int" --> E["max(1, int(env)) → _batch_size"]
    D -- "Yes, invalid" --> F["Log warning → _DEFAULT_BATCH_SIZE (200)"]
    D -- "Not set" --> G["_DEFAULT_BATCH_SIZE (200)"]
    B -- "Explicit int" --> H["max(1, batch_size) → _batch_size"]
    E --> Z[_batch_size resolved]
    F --> Z
    G --> Z
    H --> Z
    Z --> I["Drain loop uses _batch_size as batch fill cap"]
Loading

Reviews (1): Last reviewed commit: "perf(tracing): raise span-queue batch de..." | Re-trigger Greptile

…v-tunable

The async span queue batched at 50 spans / 100ms linger. For high-volume
span ingest that means many small upsert_batch PUTs — each a separate HTTP
round trip and a separate INSERT statement on the backend. Raise the defaults
to 200 spans / 250ms so batches fill before flushing, amortizing the per-request
and per-statement overhead (still well under the backend's 1000-row cap).

Also make batch_size resolvable from AGENTEX_SPAN_QUEUE_BATCH_SIZE, matching the
existing env-override pattern for linger_ms / max_size / max_retries / concurrency
(batch_size was the only queue knob not tunable without an SDK release).

Resolution order: explicit arg > AGENTEX_SPAN_QUEUE_BATCH_SIZE env > default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

This PR is targeting main, but PRs should target the next branch by default.

The main branch is reserved for release-please and Stainless automation. To resolve, pick one of:

  • Re-target the PR to next (recommended). On the PR page, click Edit next to the title and change the base branch to next.
  • Add the target-main label if this is an intentional exception (e.g. an urgent hotfix). The check will re-run and pass.

See CONTRIBUTING.md for the full branch model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant