v4.3.0 #9971
mudler
announced in
Announcements
v4.3.0
#9971
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🎉 LocalAI 4.3.0 Release! 🚀
LocalAI 4.3.0 is out!
This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery
verification:policy, with an opt-in strict mode that fails closed.The
llama-cppserver-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.📌 TL;DR
not_beforerevocation, opt-in strict mode.llama-cppserver-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.probeHealth, async per-node installs with streaming progress, unified backend-logs entry point.LOCALAI_TRACING_MAX_BODY_BYTEScaps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.flake.nix+ dev shell.vllm/sglang/vllm-omniL4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).🚀 New Features & Major Enhancements
🔐 Signed Backends with Keyless Cosign
LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.
The producer side (
.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy:tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-galleryverification:policy:not_beforeis the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.--require-backend-integrity(orLOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.Rollout is backward-compatible: until a gallery ships a
verification:block, installs proceed with a warning. The defaultbackend/index.yamlwill be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.⚡ Prompt Cache: On by Default
llama-cppships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.Two changes, one default flip each:
kv_unified=trueby default ingrpc-server.cpp. The previousfalsewas silently force-disablingcache_idle_slotsat server init (the host prompt cache was being allocated but never written across requests).prompt_cache_alldefaults totrueat the YAML layer, matching upstreamllama.cpp's owncommon.hdefault. The per-requestcache_promptknob is now on out of the box.You can still opt out with
options: ["kv_unified:false"]orprompt_cache_all: false, and there are new option keys (cache_idle_slots,checkpoint_every_nt) for tuning. Docs indocs/content/advanced/model-configuration.mdgot a worked example for the repeated-system-prompt workload and a proper explanation of howkv_unified,cache_ram, andcache_idle_slotsinteract.📊 Per-API-Key Usage Tracking
Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".
usage_recordsgainedSource(apikey/web/legacy),APIKeyID,APIKeyName, plus an idempotent backfill of pre-feature rows onInitDB.*UserAPIKeyand the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as(revoked)).GET /api/auth/usage/sources(self, no legacy) andGET /api/auth/admin/usage/sources(admin, withuser_id/api_key_idfilters, 200-key truncation).(source, user_id, user_name)so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.Docs:
features/authentication.mdgained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.🛰️ Distributed Mode v3
Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.
Per-request routing across replicas (#9968) restores cross-node load balancing. The bug:
ModelLoader.Loadcached a*Modelwhose embeddedInFlightTrackingClientwas bound to a single(nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:Now
SmartRouter.Routeruns per request, the existingin_flight ASC, last_used ASC, available_vram DESCround-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQLORDER BYand the Go picker agree on a seeded dataset.probeHealthis now memoized per(nodeID, addr)with a 30s TTL andsingleflightcoalescing, so a burst of new requests doesn't stall on aHealthCheckthat llama.cpp serializes against in-flightPredict.Async per-node installs via the gallery job queue (#9928).
POST /api/nodes/:id/backends/installused to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 +jobIDimmediately, scoped to a one-elementtargetNodeIDsallowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces anodeIDfield for attribution.Resilient backend installs with streaming progress (#9958). Two phases:
LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT/LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUTenv vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomesrunning_on_worker, the queue row stays alive without bumpingAttempts, andListBackendsproactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick).BackendInstallProgressEventvalues on a transientnodes.<nodeID>.backend.install.<opID>.progresssubject. The master subscribes for the duration of the request and forwards each event intoOpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.Unified backend-logs entry point (#9949).
/app/backend-logs/:modelIdis now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probesnodesApi.getModels, filters bymodel_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit →<Navigate replace>to the per-node logs URL preserving the?from=deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.Bug-hunt harness. A new distributed test harness landed in
tests/distributed/to catch the kind of regressions the #9968 reproducer surfaced.🩺 Admin Traces UI: Stays Responsive Under Load
Two complementary caps fix the symptom where the admin Traces page sat in "loading" forever on a chatty agent-pool RAG deployment.
LOCALAI_TRACING_MAX_BODY_BYTES(default 64 KiB) caps each captured request/response body in the trace middleware. The full payload still flows to the real client; only the trace copy is bounded.body_truncated+ originalbody_bytesare recorded so the dashboard can surface that truncation happened. Observed before the fix on a live deployment:/api/tracesreturned 44.6 MB (466 traces, 447/embeddings, top body 1.38 MB). The Traces UI Clear button is also kept enabled during loading, which is exactly when you need it.RecordBackendTracewalks theDatamap and replaces any string value larger than the cap with<truncated: N bytes>. Producers (core/backend/llm.go,core/trace/audio_snippet.go) apply head-preserving truncation upstream so the UI still shows useful leading content. TTS /audio_transformtraces drop the base64 snippet when the encoded blob exceeds the cap (truncated base64 is undecodable; the ReactWaveformPlayeralready no-ops without it).Both knobs are live-tunable from the Traces settings panel.
🧊 Nix Flake for NixOS Users
New
flake.nix+flake.lockship a reproducible, dockerless setup for NixOS, plus a dev shell for hacking on LocalAI without a container.🦾 Jetson Thor (L4T13) Backends Restored
The
cuda13-nvidia-l4t-arm64-vllm/sglang/vllm-omnibackends crashed at import with an undefinedc10::MessageLoggersymbol after thepypi.jetson-ai-lab.io/sbsa/cu130mirror started shipping torch 2.11 next to vllm/sglang wheels built against torch 2.10. Per the PyTorch April 2026 announcement, all three backends now pull from PyPI's official aarch64 + cu130 wheels instead, with the L4T13pyproject.tomlretired in favor of the standardrequirements-${profile}.txtpattern used everywhere else.📎 Chat: File Attachments + Stream Usage + Selection
Three independent fixes that together make the chat experience visibly better:
.txt,.md,.csv,.jsoncontent was silently dropped inuseChat.js(only image_url and audio_url branches added content; theelsebranch only pushed metadata).Home.jsxalso never calledfile.text()for files attached from the home screen. Both fixed. PDF files still need a parser (PDF.js or server-side extraction) and are flagged as a known limitation.include_usagereturns non-zero with tools (fix(openai): stream usage non-zero when tools are enabled #9941). Fixes Streaming usage accounting returns zeros when tools/function calling are enabled. #9927.processToolsdiscarded the cumulativeTokenUsagefromComputeChoices, so the streaming trailer reported{0, 0, 0}whenever atoolsarray was present. The fix forwards the authoritative final usage via a sentinel chunk beforeclose(responses), with the outer loop updated to capture before the empty-Choices skip. The OpenAI streaming spec contract is preserved (intermediate chunks still carry nousage).lastHtml === nextHtmlshort-circuit in its DOM diff, so the 1s/api/operationspoll re-assigningsetOperationswith a fresh array reference was collapsing text selection on every assistant message. Now JSON-compared and short-circuited. Bonus: the per-message copy button works over plain HTTP via a hidden-textarea +execCommand('copy')fallback whennavigator.clipboardis unavailable.🔧 llama.cpp Stability + Refactors
tensor_buft_overridessentinel terminator (fix(llama-cpp): terminate tensor_buft_overrides with sentinel #9919). Mirror upstreamcommon/arg.cpp:645-658: pad placeholders at the end of the main vector soback().pattern == nullptrholds, and append a single{nullptr, nullptr}to the draft vector when non-empty.refactor(agents): bump skillserver, drop redundant Name(refactor(agents): bump skillserver, drop redundant Name from list_skills output #9916).list_skillsandsearch_skillsnow return the same shape (onlyid, no duplicatedname). Adds a Ginkgo regression that drives the LocalAIFilesystemManagerthrough an in-process MCP session.🛠️ CI & Image Plumbing
master-<epoch>-<sha>so they sort by build time. The pre-existingmastertag still moves withHEAD.COSIGN_EXPERIMENTAL=1is set for the oci-1-1 referrers mode in the backend-signing job to keep current cosign versions happy.🐛 Bug Fixes (recap)
fix(distributed): route per request across loaded replicas + cache probeHealth- fix(distributed): route per request across loaded replicas + cache probeHealth #9968fix(distributed): make admin backend installs resilient and observable- fix(distributed): make admin backend installs resilient and observable #9958fix(nodes): make per-node backend install async via gallery job queue- fix(nodes): make per-node backend install async via gallery job queue #9928fix(traces): cap backend trace Data to keep admin UI responsive- fix(traces): cap backend trace Data to keep admin UI responsive #9960fix(traces): cap captured body size to keep admin Traces UI responsive- fix(traces): cap captured body size to keep admin Traces UI responsive #9946fix(react-ui): unify backend-logs entry point for distributed mode- fix(react-ui): unify backend-logs entry point for distributed mode #9949fix: inject text-file content into chat completions messages- fix: inject text-file content into chat completions messages #9896fix(openai): stream usage non-zero when tools are enabled- fix(openai): stream usage non-zero when tools are enabled #9941fix(react-ui/chat): stop wiping selection on every /api/operations poll- fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) #9917fix(llama-cpp): terminate tensor_buft_overrides with sentinel- fix(llama-cpp): terminate tensor_buft_overrides with sentinel #9919fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels- fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels #9950fix(nix): correct flake src path and add dev shell- fix(nix): correct flake src path and add dev shell #9894Fix backend manifest merge signing on current cosign releases- Fix backend manifest merge signing on current cosign releases #9957[utils] Fail immediately on extraction errors- [utils] Fail immediately on extraction errors #9926👒 Dependencies
Heavy bump cycle across submodules and Go/Python deps:
ggml-org/llama.cpp: 7 bumps (chore: ⬆️ Update ggml-org/llama.cpp to87589042cac2c390cec8d68fb2fad64e0a2a252a#9855, chore: ⬆️ Update ggml-org/llama.cpp to5cbaa5e69e09bde3334cd8c355570553a0dca027#9876, chore: ⬆️ Update ggml-org/llama.cpp to67ace021da905e27ecbdf1176b0eef578a5288c0#9897, chore: ⬆️ Update ggml-org/llama.cpp toad277572619fcfb6ddd38f4c6437283a4b2b8636#9915, chore: ⬆️ Update ggml-org/llama.cpp tobb28c1fe246b72276ee1d00ce89306be7b865766#9934, chore: ⬆️ Update ggml-org/llama.cpp to1acee6bf8939948f9bcbf4b14034e4b475f06069#9952, chore: ⬆️ Update ggml-org/llama.cpp toc0c7e147e7efa6c5858754b47259ba4880f8a906#9963)ggml-org/whisper.cpp: 4 bumps (chore: ⬆️ Update ggml-org/whisper.cpp to47b9eb37a33c5031a1b667ace64477330b9f36c1#9877, chore: ⬆️ Update ggml-org/whisper.cpp toafa2ea544fb4b0448916b4a31ecd33c8685bd482#9898, chore: ⬆️ Update ggml-org/whisper.cpp to8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae#9929, chore: ⬆️ Update ggml-org/whisper.cpp to0ccd896f5b882628e1c077f9769735ef4ce52860#9954)ikawrakow/ik_llama.cpp: 7 bumps (chore: ⬆️ Update ikawrakow/ik_llama.cpp toc35189d83c91aad780aba62b89f2830cb2916223#9866, chore: ⬆️ Update ikawrakow/ik_llama.cpp to40aae0b6d86d50c0ee7011b3ce59a233203e430a#9875, chore: ⬆️ Update ikawrakow/ik_llama.cpp to77413bc900f9a2bfd8a5407f184427bcc0825f6c#9899, chore: ⬆️ Update ikawrakow/ik_llama.cpp to11a1fea9e291f12ce2c803a9d7812c30ca806bcf#9914, chore: ⬆️ Update ikawrakow/ik_llama.cpp to48a55f74e4c6e2aeda363dd386c1ac9170a0af71#9930, chore: ⬆️ Update ikawrakow/ik_llama.cpp tob3d39cff8bffbd67296d6badd4076a1486a0715c#9953, chore: ⬆️ Update ikawrakow/ik_llama.cpp to642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02#9966)leejet/stable-diffusion.cpp: 4 bumps (chore: ⬆️ Update leejet/stable-diffusion.cpp to5b0267e941cade15bd80089d89838795d9f4baa6#9907, chore: ⬆️ Update leejet/stable-diffusion.cpp to3a8788cb7d74f185d6b18688e9563015524ecaf5#9933, chore: ⬆️ Update leejet/stable-diffusion.cpp to0baf721215f45335a5df8caf0ecb34e870c956e7#9955, chore: ⬆️ Update leejet/stable-diffusion.cpp toa397e03488cc27e1a42da646b82dfce9f50741c0#9965)antirez/ds4: 5 bumps (chore: ⬆️ Update antirez/ds4 toc9dd9499bfa57c1bbfbb4446eff963330ab5329b#9864, chore: ⬆️ Update antirez/ds4 to599e49d253971451f710cb8323344e789906ed6c#9900, chore: ⬆️ Update antirez/ds4 to2606543be7a8c125a32cee37f5d1d85dc78f2fcf#9909, chore: ⬆️ Update antirez/ds4 to8d576642c39b9a2d782a80159ba84ef5a81c0b81#9932, chore: ⬆️ Update antirez/ds4 to444afce822057d87f14c4dec307dce24fd49b3ee#9964)ace-step/acestep.cpp: bumped toed53cafwith wrapper API adapted (chore(acestep-cpp): bump pin to ed53caf and adapt wrapper to new API #9908, chore: ⬆️ Update ace-step/acestep.cpp toed53caf164e4492a5620b2e3f2264629cf66da24#9913)alecthomas/kong1.14→1.15 (chore(deps): bump github.com/alecthomas/kong from 1.14.0 to 1.15.0 #9881),aws/aws-sdk-go-v21.41.6→1.41.7 (chore(deps): bump github.com/aws/aws-sdk-go-v2 from 1.41.6 to 1.41.7 #9892),onsi/ginkgo/v22.28.2→2.29.0 (chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.2 to 2.29.0 #9882),golang.org/x/crypto0.50→0.51 (chore(deps): bump golang.org/x/crypto from 0.50.0 to 0.51.0 #9886)transformers≥5.8.1 (chore(deps): update transformers requirement from >=5.8.0 to >=5.8.1 in /backend/python/transformers #9883),sentence-transformers5.4→5.5 (chore(deps): bump sentence-transformers from 5.4.0 to 5.5.0 in /backend/python/transformers #9888)📖 Documentation
docs: :arrow_up: update docs version mudler/LocalAI- docs: ⬆️ update docs version mudler/LocalAI #9863🙌 New Contributors
Enjoy!
Full Changelog: v4.2.6...v4.3.0
This discussion was created from the release v4.3.0.
Beta Was this translation helpful? Give feedback.
All reactions