ci(linux): build fat package with GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS by Geramy · Pull Request #5 · lemonade-sdk/stable-diffusion.cpp

Geramy · 2026-05-06T20:10:54Z

What

Switch the Linux x86_64 `ubuntu-latest-cmake` (cpu) and `ubuntu-latest-rocm` (HIP) builds from the AVX2/FMA/F16C portable baseline (#3) to a fat-package build:

```
-DGGML_NATIVE=OFF
-DGGML_BACKEND_DL=ON
-DGGML_CPU_ALL_VARIANTS=ON
```

The build now produces:

`libstable-diffusion.so` — main shared library
`libggml-cpu-sandybridge.so` — AVX
`libggml-cpu-haswell.so` — AVX2 + FMA + F16C
`libggml-cpu-skylakex.so` — + AVX-512F
`libggml-cpu-icelake.so` — + AVX-512 VNNI
`libggml-cpu-alderlake.so` — + AVX-VNNI + DOTPROD
`libggml-cpu-x64.so` — no-SIMD fallback

At runtime, `ggml_backend_load_all_from_path` (already wired by upstream PR leejet#1448) dlopens each variant, queries `__builtin_cpu_supports`, and picks the highest-tier match. Same zip works on a 2014 Sandy Bridge laptop and a 2024 Alder Lake server — without the runner-of-the-day AVX-512 lottery that crashed master-593.

Tradeoff

Linux x86_64 zip: ~12 MB → ~50–80 MB. Acceptable IMO — Lemonade and similar consumers cache the extracted dir across model loads, so this is paid once at install. In exchange you stop choosing between portability and AVX-512 perf — you get both.

Why not Windows / macOS

Windows AVX2 build already pins `GGML_NATIVE=OFF -DGGML_AVX2=ON`. Could get the same fat-package treatment for symmetry, but that's a separate change and not required to fix the consumer-side AVX-512 SIGILL pattern.
macOS arm64 all-Apple-Silicon parts share a uniform NEON+DOTPROD+i8mm+bf16 baseline (M1+), so `-march=native` does not introduce a portability problem there. Current Metal flags are already optimal.

Verification plan

Trigger a workflow_dispatch release on this branch (`create_release: true`).
Inspect the resulting `sd-{hash}-bin-Linux-Ubuntu-24.04-x86_64.zip` — it should contain `libstable-diffusion.so` plus several `libggml-cpu-*.so` files.
On an AVX-512-less host: `./sd-server -m model.safetensors` → uses haswell variant, no SIGILL.
On an AVX-512 host: same command → uses skylakex/icelake/alderlake variant, perf matches a native-compiled build.
Bump `sd-cpp` pins on lemonade-sdk/lemonade PR #1777 to the new tag and confirm `Test ollama (ubuntu-latest)` passes.

Lineage

Replaces #3's portable AVX2 baseline. #3 fixed the SIGILL but left AVX-512-class hosts running AVX2 code; this gets full perf on those hosts.

Replace the AVX2/FMA/F16C portable baseline (#3) with a fat-package build that produces one libstable-diffusion.so plus a libggml-cpu-*.so per CPU variant — sandybridge, haswell, skylakex (AVX-512F), icelake (AVX-512 + VNNI), alderlake (AVX-512 + VNNI + DOTPROD), and a pure-x64 fallback. At runtime ggml dlopens the variants and picks the highest-tier one the host CPU supports. AVX-512 hosts get AVX-512 perf; older boxes fall back gracefully — no -march=native runner lottery, no SIGILL. Tradeoff: zip grows from ~12 MB → ~50–80 MB. Acceptable for a one-time download, especially since downstream consumers (Lemonade) cache the extracted directory across model loads. Applied to ubuntu-latest-cmake (CPU) and ubuntu-latest-rocm (HIP), since the HIPBLAS build still uses ggml CPU ops for parts of the pipeline. Windows AVX2 already pins GGML_NATIVE=OFF + AVX2 only, and macOS arm64 shares a uniform NEON+DOTPROD+i8mm+bf16 baseline across all Apple Silicon generations, so neither needs the same treatment. Upstream PR leejet#1448 (commit b8079e2) wired the runtime backend discovery code into libstable-diffusion.so already; this just enables the build flag that produces the variant .so files.

ci: standardize CUDA artifact names and compression formats

Copilot

Pull request overview

Updates the Linux x86_64 CI build configuration to produce a “fat” CPU package by enabling ggml’s dynamic backend loading and building all CPU ISA variants, improving portability (avoiding SIGILL on non-AVX-512 hosts) while retaining high performance on newer CPUs.

Changes:

Switch ubuntu-latest-cmake from a fixed AVX2/FMA/F16C baseline to GGML_BACKEND_DL=ON + GGML_CPU_ALL_VARIANTS=ON (with GGML_NATIVE=OFF).
Apply the same fat-package approach to the ubuntu-latest-rocm (HIPBLAS) build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+          # x64). At runtime ggml dlopens whichever variant is highest-priority on
+          # the host CPU, so an AVX-512 host gets AVX-512 perf and an AVX-512-less
+          # host falls back to haswell — same zip, no -march=native runner
+          # lottery, no SIGILL.


kenvandine pushed a commit that referenced this pull request Jun 1, 2026

Merge pull request #5 from Phqen1x/fix-cuda-names-final

10a7ba7

ci: standardize CUDA artifact names and compression formats

kenvandine requested a review from Copilot June 1, 2026 12:44

Copilot started reviewing on behalf of kenvandine June 1, 2026 12:45 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(linux): build fat package with GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS#5

ci(linux): build fat package with GGML_BACKEND_DL + GGML_CPU_ALL_VARIANTS#5
Geramy wants to merge 1 commit into
lemonadefrom
geramy/cpu-all-variants

Geramy commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Geramy commented May 6, 2026

What

Tradeoff

Why not Windows / macOS

Verification plan

Lineage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants