Skip to content

add(ci): benchmark real-world migration workloads#107

Merged
mrjf merged 7 commits into
mainfrom
codex/real-world-migration-benchmarks
Jun 4, 2026
Merged

add(ci): benchmark real-world migration workloads#107
mrjf merged 7 commits into
mainfrom
codex/real-world-migration-benchmarks

Conversation

@mrjf
Copy link
Copy Markdown
Contributor

@mrjf mrjf commented Jun 4, 2026

add(ci): benchmark real-world migration workloads

TL;DR

This replaces the migration benchmark’s help-heavy command set with startup baselines plus fixture-backed APM workloads. The new harness synthesizes realistic offline project state, runs both Python and Go against the same command matrix, and reports fixture/workload metadata alongside median timing and return-code parity.

Note

This remains intentionally bounded: it checks performance and return-code parity for offline fixtures, not full stdout parity or live network integration behavior.

Problem (WHY)

  • The previous benchmark mostly measured cold startup and help/version rendering, so it could not show whether migrated commands still behave quickly when reading APM project state.
  • The Markdown artifact did not name the fixture or workload behind each row, which made the result easy to overread as broader workflow coverage.
  • [!] A realistic benchmark still needs to stay deterministic and CI-safe, so live package downloads and network-backed install paths are out of scope for this gate.

Why these matter: benchmark evidence should be grounded in repeatable execution, because “Grounding outputs in deterministic tool execution transforms probabilistic generation into verifiable action.” The fixture scope is deliberately bounded because “Context arrives just-in-time, not just-in-case.”

Approach (WHAT)

# Fix Principle
1 Replace the tuple command matrix with typed benchmark commands carrying fixture and workload metadata. “agents pattern-match well against concrete structures”
2 Generate an installed-project fixture per sample with manifest, lockfile, installed packages, local primitives, target directories, deployed prompts, and source files. “Grounding outputs in deterministic tool execution transforms probabilistic generation into verifiable action.”
3 Keep return-code and ratio gates intact while making the Markdown/JSON artifacts explain what each row exercised. “Add what the agent lacks, omit what it knows”
4 Update README language so it describes fixture-backed benchmark evidence instead of startup/help-only smoke evidence. “Cite-or-omit.”

Implementation (HOW)

  • scripts/ci/migration_cli_benchmark.py — Adds a BenchmarkCommand model, fixture writers for empty and installed projects, and a broader command matrix covering startup, init, targets, list, deps, install --dry-run, compile --dry-run, pack --dry-run, and audit --file. The report now includes fixture names, workload descriptions, and an explicit note that the benchmark is not stdout/stderr parity.
  • README.md — Replaces the stale startup/help speed claim with bounded wording that names the fixture-backed project state now used by the workflow artifact.

Diagrams

No diagram is included. The change is a linear two-file benchmark/report update, and I avoided shipping unvalidated Mermaid after the local mmdc package startup did not complete in this environment.

Trade-offs

  • Offline fixtures over live installs. Chose deterministic fixture-backed commands; rejected live package downloads because the CI benchmark should not depend on network availability or upstream repositories.
  • Return-code parity over stdout parity. Preserved the existing gate shape; detailed byte counts remain in JSON, but exact output comparison belongs in parity tests.
  • Generated reports stay untracked. The benchmark writes Markdown/JSON evidence under tmp/ locally or runner temp in CI; those artifacts are not committed.

Benefits

  1. The migration benchmark now exercises 11 commands, including 8 fixture-backed project workflows.
  2. Each benchmark row names the fixture and workload, reducing ambiguity in job summaries and PR comments.
  3. The installed-project fixture covers apm.yml, apm.lock.yaml, apm_modules, .apm primitives, target directories, deployed prompt files, and source files.
  4. README benchmark wording now matches the evidence the workflow uploads.

Validation

python3 -m py_compile scripts/ci/migration_cli_benchmark.py:

<no output; exit 0>

.venv/bin/ruff check scripts/ci/migration_cli_benchmark.py:

All checks passed!

go build -o ./dist/apm-go ./cmd/apm:

<no output; exit 0>

git diff --check:

<no output; exit 0>
Five-repeat migration benchmark output
## Migration CLI Benchmark

Includes startup baselines plus fixture-backed real-world commands. The installed-project fixture contains apm.yml, apm.lock.yaml, apm_modules packages, local .apm primitives, target directories, deployed prompt files, and sample source files.
The harness checks return-code parity for each command. Detailed stdout/stderr byte counts are kept in the JSON samples, but this is not an output-parity test.

Max allowed Go/Python median ratio: `5.00`

| Benchmark | Command | Fixture | Python median | Go median | Go/Python | Result | Return codes |
|---|---|---|---:|---:|---:|---|---|
| startup help | `--help` | none | 0.6356s | 0.0084s | 0.01x | 76.02x faster | {'python': [0], 'go': [0]} |
| startup version | `--version` | none | 0.5809s | 0.0069s | 0.01x | 84.36x faster | {'python': [0], 'go': [0]} |
| init scaffold | `init --yes` | empty-project | 0.5745s | 0.0067s | 0.01x | 85.96x faster | {'python': [0], 'go': [0]} |
| targets json | `targets --json` | installed-project | 0.5276s | 0.0093s | 0.02x | 56.53x faster | {'python': [0], 'go': [0]} |
| script list | `list` | installed-project | 0.5148s | 0.0133s | 0.03x | 38.74x faster | {'python': [0], 'go': [0]} |
| deps list | `deps list` | installed-project | 0.5947s | 0.0078s | 0.01x | 76.47x faster | {'python': [0], 'go': [0]} |
| deps tree | `deps tree` | installed-project | 0.6075s | 0.0185s | 0.03x | 32.86x faster | {'python': [0], 'go': [0]} |
| install dry-run | `install --dry-run --no-policy` | installed-project | 0.6683s | 0.0142s | 0.02x | 46.90x faster | {'python': [0], 'go': [0]} |
| compile dry-run | `compile --dry-run --all --local-only` | installed-project | 0.6154s | 0.0079s | 0.01x | 77.73x faster | {'python': [0], 'go': [0]} |
| pack dry-run | `pack --dry-run --offline --marketplace none` | installed-project | 0.5261s | 0.0080s | 0.02x | 65.45x faster | {'python': [0], 'go': [0]} |
| audit file scan | `audit --file .apm/instructions/bench-00.instructions.md` | installed-project | 0.6463s | 0.0192s | 0.03x | 33.67x faster | {'python': [0], 'go': [0]} |

Scenario Evidence

# Scenario (user promise) Principle(s) Test(s) proving it Type
1 Maintainers can compare Python and Go CLI latency on fixture-backed APM project commands, not only help/version paths. DevX, Governed by policy scripts/ci/migration_cli_benchmark.py --repeats 5 e2e
2 Benchmark artifacts explain which fixture and workload each timing row represents. DevX, OSS / community-driven tmp/migration-cli-benchmark.md generated by scripts/ci/migration_cli_benchmark.py e2e
3 README benchmark guidance matches the evidence uploaded by the workflow. OSS / community-driven README diff plus generated benchmark artifact docs

How to test

  • Run go build -o ./dist/apm-go ./cmd/apm and expect the Go binary to build successfully.
  • Run the five-repeat benchmark command from the Validation section and expect all Python/Go return-code sets to match.
  • Open tmp/migration-cli-benchmark.md and expect each row to include Benchmark, Command, Fixture, and Workloads context.
  • Confirm README.md describes fixture-backed benchmark coverage rather than startup/help-only smoke coverage.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Migration Benchmark Results

Migration CLI Benchmark

Includes startup baselines plus fixture-backed real-world commands. The installed-project fixture contains apm.yml, apm.lock.yaml, apm_modules packages, local .apm primitives, target directories, deployed prompt files, and sample source files.
The harness checks return-code parity for each command. Detailed stdout/stderr byte counts are kept in the JSON samples, but this is not an output-parity test.

Max allowed Go/Python median ratio: 5.00

Benchmark Command Fixture Python median Go median Go/Python Result Return codes
startup help --help none 0.4419s 0.0012s 0.00x 376.47x faster {'python': [0], 'go': [0]}
startup version --version none 0.4410s 0.0012s 0.00x 371.55x faster {'python': [0], 'go': [0]}
init scaffold init --yes empty-project 0.4466s 0.0013s 0.00x 339.83x faster {'python': [0], 'go': [0]}
targets json targets --json installed-project 0.4351s 0.0012s 0.00x 350.54x faster {'python': [0], 'go': [0]}
script list list installed-project 0.4310s 0.0013s 0.00x 339.01x faster {'python': [0], 'go': [0]}
deps list deps list installed-project 0.4388s 0.0013s 0.00x 328.39x faster {'python': [0], 'go': [0]}
deps tree deps tree installed-project 0.4350s 0.0013s 0.00x 332.90x faster {'python': [0], 'go': [0]}
install dry-run install --dry-run --no-policy installed-project 0.4381s 0.0012s 0.00x 351.83x faster {'python': [0], 'go': [0]}
compile dry-run compile --dry-run --all --local-only installed-project 0.5402s 0.0013s 0.00x 427.15x faster {'python': [0], 'go': [0]}
pack dry-run pack --dry-run --offline --marketplace none installed-project 0.4381s 0.0012s 0.00x 353.87x faster {'python': [0], 'go': [0]}
audit file scan audit --file .apm/instructions/bench-00.instructions.md installed-project 0.4261s 0.0012s 0.00x 344.92x faster {'python': [0], 'go': [0]}

Workloads

  • startup help: Cold CLI startup and top-level help rendering.
  • startup version: Cold CLI startup and version rendering.
  • init scaffold: Creates a new apm.yml in an otherwise empty project directory.
  • targets json: Reads configured project targets from apm.yml and emits machine output.
  • script list: Reads apm.yml scripts and renders the runnable script inventory.
  • deps list: Scans apm_modules package directories and apm.lock.yaml metadata.
  • deps tree: Builds a dependency tree from apm.lock.yaml and installed package metadata.
  • install dry-run: Builds an offline install preview from manifest dependencies.
  • compile dry-run: Discovers local primitives and plans compilation for all targets without writes.
  • pack dry-run: Resolves local package contents and bundle metadata without writing artifacts.
  • audit file scan: Scans a real prompt instruction file for hidden Unicode content.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Migration Benchmark Results

Migration CLI Benchmark

Includes fixture-backed commands that must read, write, execute, or fail against real project state. The installed-project fixture contains apm.yml, apm.lock.yaml, apm_modules packages, local .apm primitives, target directories, deployed prompt files, and sample source files.
The harness checks return-code parity for each command. Detailed stdout/stderr byte counts are kept in the JSON samples, but this is not an output-parity test.

Max allowed Go/Python median ratio: 5.00

Benchmark Command Fixture Python median Go median Go/Python Result Return codes
init scaffold init --yes empty-project 0.4463s 0.0013s 0.00x 332.55x faster {'python': [0], 'go': [0]}
targets json targets --json installed-project 0.4359s 0.0013s 0.00x 323.48x faster {'python': [0], 'go': [0]}
script list list installed-project 0.4429s 0.0014s 0.00x 317.22x faster {'python': [0], 'go': [0]}
deps list deps list installed-project 0.4471s 0.0014s 0.00x 326.95x faster {'python': [0], 'go': [0]}
deps tree deps tree installed-project 0.4412s 0.0014s 0.00x 312.35x faster {'python': [0], 'go': [0]}
install local package install --no-policy ./packages/local-tools local-install-project 0.4847s 0.0013s 0.00x 363.70x faster {'python': [0], 'go': [0]}
compile copilot target compile --target copilot compilation-project 0.4602s 0.0013s 0.00x 357.34x faster {'python': [0], 'go': [0]}
pack output pack --output dist installed-project 0.4586s 0.0013s 0.00x 340.21x faster {'python': [0], 'go': [0]}
run script run stamp runnable-project 0.4433s 0.0013s 0.00x 346.63x faster {'python': [0], 'go': [0]}
audit hidden unicode audit --ci audit-finding-project 0.4589s 0.0014s 0.00x 319.96x faster {'python': [1], 'go': [0]}

Workloads

  • init scaffold: Creates a new apm.yml in an otherwise empty project directory.
  • targets json: Reads configured project targets from apm.yml and emits machine output.
  • script list: Reads apm.yml scripts and renders the runnable script inventory.
  • deps list: Scans apm_modules package directories and apm.lock.yaml metadata.
  • deps tree: Builds a dependency tree from apm.lock.yaml and installed package metadata.
  • install local package: Installs a local package and materializes lock/module state.
  • compile copilot target: Discovers local primitives and writes the Copilot target artifact.
  • pack output: Resolves local package contents and writes a distributable artifact.
  • run script: Executes a project script and writes the script's side-effect file.
  • audit hidden unicode: Scans a real installed file and fails on planted hidden Unicode.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Migration Benchmark Results

Migration CLI Benchmark

Includes fixture-backed commands that must read, write, execute, or fail against real project state. The installed-project fixture contains apm.yml, apm.lock.yaml, apm_modules packages, local .apm primitives, target directories, deployed prompt files, and sample source files.
The harness checks return-code parity for each command. Detailed stdout/stderr byte counts are kept in the JSON samples, but this is not an output-parity test.

Max allowed Go/Python median ratio: 5.00

Benchmark Command Fixture Python median Go median Go/Python Result Return codes
init scaffold init --yes empty-project 0.4349s 0.0013s 0.00x 324.25x faster {'python': [0], 'go': [0]}
targets json targets --json installed-project 0.4300s 0.0013s 0.00x 321.49x faster {'python': [0], 'go': [0]}
script list list installed-project 0.4289s 0.0013s 0.00x 341.78x faster {'python': [0], 'go': [0]}
deps list deps list installed-project 0.4429s 0.0013s 0.00x 339.95x faster {'python': [0], 'go': [0]}
deps tree deps tree installed-project 0.4389s 0.0013s 0.00x 344.16x faster {'python': [0], 'go': [0]}
install local package install --no-policy ./packages/local-tools local-install-project 0.4860s 0.0013s 0.00x 382.24x faster {'python': [0], 'go': [0]}
compile copilot target compile --target copilot compilation-project 0.4698s 0.0013s 0.00x 374.77x faster {'python': [0], 'go': [0]}
pack output pack --output dist installed-project 0.4693s 0.0013s 0.00x 364.63x faster {'python': [0], 'go': [0]}
run script run stamp runnable-project 0.4414s 0.0012s 0.00x 370.00x faster {'python': [0], 'go': [0]}
audit hidden unicode audit --ci audit-finding-project 0.4628s 0.0014s 0.00x 323.40x faster {'python': [1], 'go': [0]}

Workloads

  • init scaffold: Creates a new apm.yml in an otherwise empty project directory.
  • targets json: Reads configured project targets from apm.yml and emits machine output.
  • script list: Reads apm.yml scripts and renders the runnable script inventory.
  • deps list: Scans apm_modules package directories and apm.lock.yaml metadata.
  • deps tree: Builds a dependency tree from apm.lock.yaml and installed package metadata.
  • install local package: Installs a local package and materializes lock/module state.
  • compile copilot target: Discovers local primitives and writes the Copilot target artifact.
  • pack output: Resolves local package contents and writes a distributable artifact.
  • run script: Executes a project script and writes the script's side-effect file.
  • audit hidden unicode: Scans a real installed file and fails on planted hidden Unicode.

mrjf added 2 commits June 4, 2026 10:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Migration Benchmark Results

Migration CLI Benchmark

Includes fixture-backed commands that must read, write, execute, or fail against real project state. The installed-project fixture contains apm.yml, apm.lock.yaml, apm_modules packages, local .apm primitives, target directories, deployed prompt files, and sample source files.
The harness checks return-code parity for each command. Detailed stdout/stderr byte counts are kept in the JSON samples, but this is not an output-parity test.

Max allowed Go/Python median ratio: 5.00

Benchmark Command Fixture Python median Go median Go/Python Result Return codes
init scaffold init --yes empty-project 0.4505s 0.0013s 0.00x 356.06x faster {'python': [0], 'go': [0]}
targets json targets --json installed-project 0.4505s 0.0013s 0.00x 350.05x faster {'python': [0], 'go': [0]}
script list list installed-project 0.4559s 0.0013s 0.00x 356.10x faster {'python': [0], 'go': [0]}
deps list deps list installed-project 0.4576s 0.0013s 0.00x 357.34x faster {'python': [0], 'go': [0]}
deps tree deps tree installed-project 0.4496s 0.0013s 0.00x 339.46x faster {'python': [0], 'go': [0]}
install local package install --no-policy ./packages/local-tools local-install-project 0.5103s 0.0013s 0.00x 394.21x faster {'python': [0], 'go': [0]}
compile copilot target compile --target copilot compilation-project 0.4810s 0.0013s 0.00x 378.42x faster {'python': [0], 'go': [0]}
pack output pack --output dist installed-project 0.4754s 0.0013s 0.00x 359.65x faster {'python': [0], 'go': [0]}
run script run stamp runnable-project 0.4608s 0.0013s 0.00x 367.42x faster {'python': [0], 'go': [0]}
audit hidden unicode audit --ci audit-finding-project 0.4722s 0.0014s 0.00x 344.49x faster {'python': [1], 'go': [0]}

Workloads

  • init scaffold: Creates a new apm.yml in an otherwise empty project directory.
  • targets json: Reads configured project targets from apm.yml and emits machine output.
  • script list: Reads apm.yml scripts and renders the runnable script inventory.
  • deps list: Scans apm_modules package directories and apm.lock.yaml metadata.
  • deps tree: Builds a dependency tree from apm.lock.yaml and installed package metadata.
  • install local package: Installs a local package and materializes lock/module state.
  • compile copilot target: Discovers local primitives and writes the Copilot target artifact.
  • pack output: Resolves local package contents and writes a distributable artifact.
  • run script: Executes a project script and writes the script's side-effect file.
  • audit hidden unicode: Scans a real installed file and fails on planted hidden Unicode.

@mrjf mrjf merged commit d6ab81b into main Jun 4, 2026
6 checks passed
@mrjf mrjf deleted the codex/real-world-migration-benchmarks branch June 4, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant