v4.3.0 #9971

mudler · 2026-05-24T20:25:27Z

mudler
May 24, 2026
Maintainer

🎉 LocalAI 4.3.0 Release! 🚀

LocalAI 4.3.0 is out!

This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.

📌 TL;DR

Feature	Summary
🔐 Signed Backends	Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, `not_before` revocation, opt-in strict mode.
⚡ Prompt Cache by Default	`llama-cpp` server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.
📊 Usage per API Key	New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history.
🛰️ Distributed v3	Per-request replica routing, cached `probeHealth`, async per-node installs with streaming progress, unified backend-logs entry point.
🩺 Traces UI Stays Snappy	`LOCALAI_TRACING_MAX_BODY_BYTES` caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.
🧊 Nix Flake	Dockerless setup for NixOS users via `flake.nix` + dev shell.
🦾 Jetson Thor Restored	`vllm` / `sglang` / `vllm-omni` L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.

The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:

verification:
  issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
  identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
  not_before: "2026-05-22T00:00:00Z"

TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_before is the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.
Digest pinning closes the TOCTOU window between verify and pull.
Strict mode: --require-backend-integrity (or LOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.

Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.

🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).

⚡ Prompt Cache: On by Default

llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.

Two changes, one default flip each:

kv_unified=true by default in grpc-server.cpp. The previous false was silently force-disabling cache_idle_slots at server init (the host prompt cache was being allocated but never written across requests).
prompt_cache_all defaults to true at the YAML layer, matching upstream llama.cpp's own common.h default. The per-request cache_prompt knob is now on out of the box.

You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.

🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).

📊 Per-API-Key Usage Tracking

Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".

usage_records gained Source (apikey / web / legacy), APIKeyID, APIKeyName, plus an idempotent backfill of pre-feature rows on InitDB.
Auth middleware plumbs the resolved *UserAPIKey and the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as (revoked)).
New endpoints: GET /api/auth/usage/sources (self, no legacy) and GET /api/auth/admin/usage/sources (admin, with user_id / api_key_id filters, 200-key truncation).
React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
Admin view (follow-up in feat(usage): attribute Sources rows to user accounts in admin view #9935) also rolls up (source, user_id, user_name) so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.

Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.

🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).

🛰️ Distributed Mode v3

Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.

Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0       (← idle, never gets traffic)

Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.

Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.

Resilient backend installs with streaming progress (#9958). Two phases:

Phase 1: LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT env vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomes running_on_worker, the queue row stays alive without bumping Attempts, and ListBackends proactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick).
Phase 2: workers publish debounced (~250ms) BackendInstallProgressEvent values on a transient nodes.<nodeID>.backend.install.<opID>.progress subject. The master subscribes for the duration of the request and forwards each event into OpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.

Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.

Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.

🔗 PRs: #9968, #9928, #9958, #9949, plus the tests(distributed): add bug-hunt harness commit.

🩺 Admin Traces UI: Stays Responsive Under Load

Two complementary caps fix the symptom where the admin Traces page sat in "loading" forever on a chatty agent-pool RAG deployment.

API-side (fix(traces): cap captured body size to keep admin Traces UI responsive #9946): LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) caps each captured request/response body in the trace middleware. The full payload still flows to the real client; only the trace copy is bounded. body_truncated + original body_bytes are recorded so the dashboard can surface that truncation happened. Observed before the fix on a live deployment: /api/traces returned 44.6 MB (466 traces, 447 /embeddings, top body 1.38 MB). The Traces UI Clear button is also kept enabled during loading, which is exactly when you need it.
Backend-side (fix(traces): cap backend trace Data to keep admin UI responsive #9960): RecordBackendTrace walks the Data map and replaces any string value larger than the cap with <truncated: N bytes>. Producers (core/backend/llm.go, core/trace/audio_snippet.go) apply head-preserving truncation upstream so the UI still shows useful leading content. TTS / audio_transform traces drop the base64 snippet when the encoded blob exceeds the cap (truncated base64 is undecodable; the React WaveformPlayer already no-ops without it).

Both knobs are live-tunable from the Traces settings panel.

🔗 PRs: #9946 (API side), #9960 (backend side).

🧊 Nix Flake for NixOS Users

New flake.nix + flake.lock ship a reproducible, dockerless setup for NixOS, plus a dev shell for hacking on LocalAI without a container.

🔗 PRs: #9851 (initial flake), #9894 (correct src path + dev shell).

🦾 Jetson Thor (L4T13) Backends Restored

The cuda13-nvidia-l4t-arm64-vllm / sglang / vllm-omni backends crashed at import with an undefined c10::MessageLogger symbol after the pypi.jetson-ai-lab.io/sbsa/cu130 mirror started shipping torch 2.11 next to vllm/sglang wheels built against torch 2.10. Per the PyTorch April 2026 announcement, all three backends now pull from PyPI's official aarch64 + cu130 wheels instead, with the L4T13 pyproject.toml retired in favor of the standard requirements-${profile}.txt pattern used everywhere else.

🔗 PR: #9950.

📎 Chat: File Attachments + Stream Usage + Selection

Three independent fixes that together make the chat experience visibly better:

Text-file attachments actually reach the model (regression from react-ui port) (fix: inject text-file content into chat completions messages #9896). .txt, .md, .csv, .json content was silently dropped in useChat.js (only image_url and audio_url branches added content; the else branch only pushed metadata). Home.jsx also never called file.text() for files attached from the home screen. Both fixed. PDF files still need a parser (PDF.js or server-side extraction) and are flagged as a known limitation.
Stream include_usage returns non-zero with tools (fix(openai): stream usage non-zero when tools are enabled #9941). Fixes Streaming usage accounting returns zeros when tools/function calling are enabled. #9927. processTools discarded the cumulative TokenUsage from ComputeChoices, so the streaming trailer reported {0, 0, 0} whenever a tools array was present. The fix forwards the authoritative final usage via a sentinel chunk before close(responses), with the outer loop updated to capture before the empty-Choices skip. The OpenAI streaming spec contract is preserved (intermediate chunks still carry no usage).
Chat selection stops getting wiped every second (fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) #9917). React 19 dropped the old lastHtml === nextHtml short-circuit in its DOM diff, so the 1s /api/operations poll re-assigning setOperations with a fresh array reference was collapsing text selection on every assistant message. Now JSON-compared and short-circuited. Bonus: the per-message copy button works over plain HTTP via a hidden-textarea + execCommand('copy') fallback when navigator.clipboard is unavailable.

🔧 llama.cpp Stability + Refactors

tensor_buft_overrides sentinel terminator (fix(llama-cpp): terminate tensor_buft_overrides with sentinel #9919). Mirror upstream common/arg.cpp:645-658: pad placeholders at the end of the main vector so back().pattern == nullptr holds, and append a single {nullptr, nullptr} to the draft vector when non-empty.
refactor(agents): bump skillserver, drop redundant Name (refactor(agents): bump skillserver, drop redundant Name from list_skills output #9916). list_skills and search_skills now return the same shape (only id, no duplicated name). Adds a Ginkgo regression that drives the LocalAI FilesystemManager through an in-process MCP session.
Swagger refreshes (feat(swagger): update swagger #9872, feat(swagger): update swagger #9962) keep the OpenAPI surface in sync with the routes added this cycle.

🛠️ CI & Image Plumbing

Chronologically-orderable master tags. Master images are now also tagged master-<epoch>-<sha> so they sort by build time. The pre-existing master tag still moves with HEAD.
Backend signing CI: COSIGN_EXPERIMENTAL=1 is set for the oci-1-1 referrers mode in the backend-signing job to keep current cosign versions happy.

🐛 Bug Fixes (recap)

🔁 fix(distributed): route per request across loaded replicas + cache probeHealth - fix(distributed): route per request across loaded replicas + cache probeHealth #9968
🧰 fix(distributed): make admin backend installs resilient and observable - fix(distributed): make admin backend installs resilient and observable #9958
⏳ fix(nodes): make per-node backend install async via gallery job queue - fix(nodes): make per-node backend install async via gallery job queue #9928
🧾 fix(traces): cap backend trace Data to keep admin UI responsive - fix(traces): cap backend trace Data to keep admin UI responsive #9960
🧾 fix(traces): cap captured body size to keep admin Traces UI responsive - fix(traces): cap captured body size to keep admin Traces UI responsive #9946
🪟 fix(react-ui): unify backend-logs entry point for distributed mode - fix(react-ui): unify backend-logs entry point for distributed mode #9949
📎 fix: inject text-file content into chat completions messages - fix: inject text-file content into chat completions messages #9896
🧮 fix(openai): stream usage non-zero when tools are enabled - fix(openai): stream usage non-zero when tools are enabled #9941
🧷 fix(react-ui/chat): stop wiping selection on every /api/operations poll - fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) #9917
🧱 fix(llama-cpp): terminate tensor_buft_overrides with sentinel - fix(llama-cpp): terminate tensor_buft_overrides with sentinel #9919
🦾 fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels - fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels #9950
🧊 fix(nix): correct flake src path and add dev shell - fix(nix): correct flake src path and add dev shell #9894
🔐 Fix backend manifest merge signing on current cosign releases - Fix backend manifest merge signing on current cosign releases #9957
🧯 [utils] Fail immediately on extraction errors - [utils] Fail immediately on extraction errors #9926

👒 Dependencies

Heavy bump cycle across submodules and Go/Python deps:

ggml-org/llama.cpp: 7 bumps (chore: ⬆️ Update ggml-org/llama.cpp to 87589042cac2c390cec8d68fb2fad64e0a2a252a #9855, chore: ⬆️ Update ggml-org/llama.cpp to 5cbaa5e69e09bde3334cd8c355570553a0dca027 #9876, chore: ⬆️ Update ggml-org/llama.cpp to 67ace021da905e27ecbdf1176b0eef578a5288c0 #9897, chore: ⬆️ Update ggml-org/llama.cpp to ad277572619fcfb6ddd38f4c6437283a4b2b8636 #9915, chore: ⬆️ Update ggml-org/llama.cpp to bb28c1fe246b72276ee1d00ce89306be7b865766 #9934, chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 #9952, chore: ⬆️ Update ggml-org/llama.cpp to c0c7e147e7efa6c5858754b47259ba4880f8a906 #9963)
ggml-org/whisper.cpp: 4 bumps (chore: ⬆️ Update ggml-org/whisper.cpp to 47b9eb37a33c5031a1b667ace64477330b9f36c1 #9877, chore: ⬆️ Update ggml-org/whisper.cpp to afa2ea544fb4b0448916b4a31ecd33c8685bd482 #9898, chore: ⬆️ Update ggml-org/whisper.cpp to 8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae #9929, chore: ⬆️ Update ggml-org/whisper.cpp to 0ccd896f5b882628e1c077f9769735ef4ce52860 #9954)
ikawrakow/ik_llama.cpp: 7 bumps (chore: ⬆️ Update ikawrakow/ik_llama.cpp to c35189d83c91aad780aba62b89f2830cb2916223 #9866, chore: ⬆️ Update ikawrakow/ik_llama.cpp to 40aae0b6d86d50c0ee7011b3ce59a233203e430a #9875, chore: ⬆️ Update ikawrakow/ik_llama.cpp to 77413bc900f9a2bfd8a5407f184427bcc0825f6c #9899, chore: ⬆️ Update ikawrakow/ik_llama.cpp to 11a1fea9e291f12ce2c803a9d7812c30ca806bcf #9914, chore: ⬆️ Update ikawrakow/ik_llama.cpp to 48a55f74e4c6e2aeda363dd386c1ac9170a0af71 #9930, chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3d39cff8bffbd67296d6badd4076a1486a0715c #9953, chore: ⬆️ Update ikawrakow/ik_llama.cpp to 642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02 #9966)
leejet/stable-diffusion.cpp: 4 bumps (chore: ⬆️ Update leejet/stable-diffusion.cpp to 5b0267e941cade15bd80089d89838795d9f4baa6 #9907, chore: ⬆️ Update leejet/stable-diffusion.cpp to 3a8788cb7d74f185d6b18688e9563015524ecaf5 #9933, chore: ⬆️ Update leejet/stable-diffusion.cpp to 0baf721215f45335a5df8caf0ecb34e870c956e7 #9955, chore: ⬆️ Update leejet/stable-diffusion.cpp to a397e03488cc27e1a42da646b82dfce9f50741c0 #9965)
antirez/ds4: 5 bumps (chore: ⬆️ Update antirez/ds4 to c9dd9499bfa57c1bbfbb4446eff963330ab5329b #9864, chore: ⬆️ Update antirez/ds4 to 599e49d253971451f710cb8323344e789906ed6c #9900, chore: ⬆️ Update antirez/ds4 to 2606543be7a8c125a32cee37f5d1d85dc78f2fcf #9909, chore: ⬆️ Update antirez/ds4 to 8d576642c39b9a2d782a80159ba84ef5a81c0b81 #9932, chore: ⬆️ Update antirez/ds4 to 444afce822057d87f14c4dec307dce24fd49b3ee #9964)
ace-step/acestep.cpp: bumped to ed53caf with wrapper API adapted (chore(acestep-cpp): bump pin to ed53caf and adapt wrapper to new API #9908, chore: ⬆️ Update ace-step/acestep.cpp to ed53caf164e4492a5620b2e3f2264629cf66da24 #9913)
Model gallery: checksum refreshes (chore(model-gallery): ⬆️ update checksum #9901, chore(model-gallery): ⬆️ update checksum #9910)
Go modules: alecthomas/kong 1.14→1.15 (chore(deps): bump github.com/alecthomas/kong from 1.14.0 to 1.15.0 #9881), aws/aws-sdk-go-v2 1.41.6→1.41.7 (chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.6 to 1.41.7 #9892), onsi/ginkgo/v2 2.28.2→2.29.0 (chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.2 to 2.29.0 #9882), golang.org/x/crypto 0.50→0.51 (chore(deps): bump golang.org/x/crypto from 0.50.0 to 0.51.0 #9886)
Python: transformers ≥5.8.1 (chore(deps): update transformers requirement from >=5.8.0 to >=5.8.1 in /backend/python/transformers #9883), sentence-transformers 5.4→5.5 (chore(deps): bump sentence-transformers from 5.4.0 to 5.5.0 in /backend/python/transformers #9888)

📖 Documentation

docs: :arrow_up: update docs version mudler/LocalAI - docs: ⬆️ update docs version mudler/LocalAI #9863
Plus inline docs updates folded into the feature PRs above (prompt-cache explainer, authentication / usage tracking section, backend signing guide).

🙌 New Contributors

@Azteczek made their first contribution in feat: add flake.nix for dockerless setup #9851
@inquam made their first contribution in fix: inject text-file content into chat completions messages #9896
@RinZ27 made their first contribution in [utils] Fail immediately on extraction errors #9926

Enjoy!

Full Changelog: v4.2.6...v4.3.0

This discussion was created from the release v4.3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v4.3.0 #9971

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

v4.3.0 #9971

Uh oh!

mudler May 24, 2026 Maintainer

🎉 LocalAI 4.3.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

⚡ Prompt Cache: On by Default

📊 Per-API-Key Usage Tracking

🛰️ Distributed Mode v3

🩺 Admin Traces UI: Stays Responsive Under Load

🧊 Nix Flake for NixOS Users

🦾 Jetson Thor (L4T13) Backends Restored

📎 Chat: File Attachments + Stream Usage + Selection

🔧 llama.cpp Stability + Refactors

🛠️ CI & Image Plumbing

🐛 Bug Fixes (recap)

👒 Dependencies

📖 Documentation

🙌 New Contributors

Replies: 0 comments

mudler
May 24, 2026
Maintainer