feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703st-gr wants to merge 5 commits into
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
/ok to test 8689967 |
8689967 to
464018b
Compare
|
Pushed a
|
NVIDIA/OpenShell's Cargo.toml has
[workspace.lints.clippy]
pedantic = "warn"
nursery = "warn"
all = "warn"
so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.
PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.
This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).
Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Carries the UDS path supplied by --compute-driver-socket. Drops Copy from the enum derive (PathBuf is not Copy); existing callers use Clone or owned values. FromStr accepts 'external:<path>' and rejects bare 'external' with a message pointing at the CLI flag. Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds an opt-in `--compute-driver-socket=PATH` flag (env `OPENSHELL_COMPUTE_DRIVER_SOCKET`) on the gateway's `RunArgs`. When set, the gateway pins `ComputeDriverKind::External(<path>)` and skips both the `--drivers` list and the auto-detection probe. This lets out-of-tree driver binaries (Kyma, custom backends) connect to a stock gateway without a rebuild. `effective_single_driver` and the `Config.compute_drivers` payload both honour the new flag so pre-runtime checks and the runtime factory dispatch agree on the configured driver. The companion dispatch arm in `lib.rs::build_compute_runtime` is wired in the follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
`build_compute_runtime` now connects a tonic `Channel` to the configured Unix domain socket using `hyper-util`'s `TokioIo` connector and wraps it in `RemoteComputeDriver` -- the same proxy used by the VM driver. Replaces the placeholder `External(_) => Err(...)` arm. Adds two helpers in `compute/mod.rs`: * `connect_external_compute_driver(socket_path)` -- a small tonic-Endpoint + tower::service_fn + UnixStream connector, parallel to the one in `compute::vm` but with no VM-specific logging or capability probing. Out-of-tree drivers manage their own readiness; the gateway just dials. * `ComputeRuntime::new_remote_external(channel, ...)` -- mirrors `new_remote_vm` but takes no `ManagedDriverProcess`. The external driver's lifecycle is the operator's responsibility (systemd unit, sidecar container, etc.). Smoke-tested: `--compute-driver-socket /tmp/nonexistent.sock` now starts the gateway, logs "Connecting to external compute driver", and fails with a clear "failed to connect to external compute driver socket '<path>': transport error" message that points at the new arm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
…uteDriverKind Adds three short inline comments at the points where the non-`Copy`-ness of `ComputeDriverKind` becomes visible to readers, so reviewers don't have to ask "why was Copy removed?" or "why clone() instead of *deref?": - `crates/openshell-core/src/config.rs`: comment above the enum derive explaining the `Copy` and `const fn as_str(self)` revert was forced by the `External(PathBuf)` variant, and that all in-tree call sites already handle `Clone`. - `crates/openshell-server/src/cli.rs`: comment at the `effective_single_driver` match arm noting why we clone instead of deref. - `crates/openshell-server/src/lib.rs`: same comment shape at the `configured_compute_driver` match arm. Squash into the corresponding earlier commit if preferred — split out as its own commit to make the rationale grep-able. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds a row to the Runtime Summary table describing the new
`--compute-driver-socket=<path>` flag and how the External arm relates
to the existing in-tree drivers (operator-managed driver process,
Unix domain socket, identical proto contract to the VM driver, same
trust boundary).
Keeps the rest of the document untouched — the per-runtime
implementation notes (e.g. which driver crate README to read) are
intentionally NOT extended for `External` because external drivers
ship their own documentation in their own repos. The reference
implementation lives at:
https://github.com/st-gr/openshell-driver-kyma
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
464018b to
1c6663d
Compare
|
Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to ( New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs: Could I get a re-vet with Apologies for the round-trip — I've added a |
elezar
left a comment
There was a problem hiding this comment.
Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.
In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.
|
@elezar A. Minimal add. Keep B. Restructured Replacing I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here. |
Summary
Adds an opt-in
--compute-driver-socket=<path>flag (envOPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existingcompute_driver.protocontract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).Related Issue
Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current
main, with all commits DCO-signed and st-gr-attributed). Same direction ascheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").No upstream tracking issue filed — happy to file one if reviewers prefer.
Changes
crates/openshell-core/src/config.rs: addsExternal(PathBuf)variant toComputeDriverKind. Drops theCopyderive andconst fn as_str(self)(both incompatible withPathBuf).FromStracceptsexternal:<path>(case-insensitive prefix) and rejects bareexternalwith a message pointing at the CLI flag.crates/openshell-server/src/cli.rs: adds--compute-driver-socket=<path>flag (envOPENSHELL_COMPUTE_DRIVER_SOCKET) onRunArgs. When set, pinsComputeDriverKind::External(<path>)and skips both--driversand the auto-detection probe.crates/openshell-server/src/config_file.rs: surfaces the new field through the config-file path.crates/openshell-server/src/lib.rs/crates/openshell-server/src/compute/mod.rs: inbuild_compute_runtime, theExternal(<path>)arm connects atonic::Channelto the UDS viahyper-util::TokioIoand wraps it inRemoteComputeDriver— the same proxy used by the in-tree VM driver.architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table, describing the trust boundary (operator-owned UDS file permissions) and the activation flag.Testing
mise run pre-commitpasses — not run end-to-end (misenot on author's dev environment), but the equivalent rust pieces verified independently:cargo fmt --all -- --checkclean;cargo clippy --no-deps -p openshell-core -p openshell-server --all-targets -- -D warningsclean.compute_driver_kind_*tests incrates/openshell-core/src/config.rs::testscover the new variant: parsesexternal:<path>, rejects bareexternaland empty path, displays asexternal:<path>, round-trips throughFromStr+Display, and is case-insensitive on the prefix. Plus 3 CLI-arg tests for the new flag (presence, override-of-drivers, env-var binding).Externaldriver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via thest-gr/openshell-driver-kymareference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation,openshell sandbox exec,openshell sandbox upload/downloadall work).Checklist
feat(core):,feat(server):,docs(core,server):,docs(arch):)architecture/compute-runtimes.md::Runtime Summary. Per-runtime implementation-notes section is intentionally NOT extended forExternalbecause external drivers ship their own documentation; the reference implementation lives atst-gr/openshell-driver-kyma.Notes for reviewers
Out of scope (intentional):
External(PathBuf)+ UDS dispatch). Operators bring their own driver process and point--compute-driver-socketat its socket path.compute_driver.protochanges. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.--drivers vmposture.hyper-utilis already in the workspaceCargo.toml; the patch reuses it via the sameTokioIoconnector path the VM driver uses.Operator context:
The forked gateway image at
ghcr.io/st-gr/openshell-gateway(carrying these patches plus an in-treeDockerfile.gateway) has been driving production-shaped Kubernetes clusters with thest-gr/openshell-driver-kymacompute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.