feat: stream model conversion by shikaku2 · Pull Request #1581 · leejet/stable-diffusion.cpp

shikaku2 · 2026-05-29T19:19:39Z

Split out from draft PR #1573: #1573

Summary

Changes --convert to stream converted tensors instead of allocating the entire converted model in one ggml_context before writing the output file.

This PR intentionally only covers the regular conversion memory/threading path. RMSE-guided conversion is not included here and will be handled separately after this is reviewed.

What changed

Collect output tensor metadata first without loading tensor data.
Write GGUF or safetensors metadata/header up front.
Load, convert, and write tensors in batches instead of keeping every converted tensor resident until the end.
Parallelize tensor loading/conversion within each batch.
Cap each batch by output tensor bytes, so large tensors still stream with bounded peak memory while smaller tensors can use available CPU threads.
Reuse the existing convert(input_path, vae_path, output_path, output_type, tensor_type_rules, convert_name) API and CLI behavior.

What is not included

No RMSE option or RMSE type selection.
No AIO/separate text encoder/diffusion/VAE packaging changes.
No --lazy-load runtime behavior changes.

Validation

cmake --build build -j16
git diff --check
Tiny safetensors -> GGUF conversion: build/bin/sd-cli -M convert -m /tmp/sdcpp-convert-tiny.safetensors -o /tmp/sdcpp-convert-tiny-final.gguf --type f16
Full SD3.5 Medium conversion: time build/bin/sd-cli -M convert -m /home/aaron/models/sd3.5-medium/sd3.5_medium.safetensors -o /tmp/sd3.5_medium_streaming_convert.gguf
- Output: /tmp/sd3.5_medium_streaming_convert.gguf, 4.8G
- Completed successfully in about 3.7s wall time on my machine

Notes

This is a draft because the new streaming writer path should get review and broader testing across output formats and platforms before being marked ready.

wbruna

First, about the coding style: this is placing format-specific logic inside convert.cpp. The format-specific code should go to the appropriate files inside model_io/, likely with a separate "write tensor" per file type. Note you should also avoid opening and closing the model files for each tensor, so some kind of "opened model file" abstraction will probably be needed. A "read the tensor at the specified offset" abstraction would probably make sense, too.

I gave this a try for a .safetensors -> Q4_K .gguf. On my machine, it was never able to saturate all CPU cores, so it got much slower than the normal conversion (around 1/2 - 1/3 speed). I/O didn't seem to be the bottleneck: system and wait times remained low.

Looking at the code, my guess would be the batching calculation: it would explain this behavior if for some reason it consistently used only 1 or 2 threads (the number of threads should also respect the --threads parameter by the way). The batching division also looks sub-optimal: you split up work between threads, then stop everything, write everything, then open threads again. So you are not allowing an overlap between the conversion and the writing; plus, a thread could finish much sooner than the others, and would stay idle until the next batch.

I would avoid the fixed batching, and use a true pipeline instead: either n read+convert threads + 1 write thread, or n read+convert+write threads, controlling for the memory budget with a condition variable. I would bet on the second option: if writing is the bottleneck, you'd naturally parallelize it as well.

Note you are not forced to write sequentially, either: you have offsets for each tensor, so they could be written as soon as they are ready, with each thread using its own open file object (I'd recommend preallocating the file at the beginning, to give the filesystem a better chance to avoid fragmentation issues). An out-of-order approach could also help with models with huge tensors, since you can try to overlap them with smaller ones.

shikaku2 · 2026-05-31T18:26:49Z

Added a follow-up commit (504d5f8) for the review feedback:

moved streaming output format logic into model_io/ via GGUF/safetensors streaming writer classes
replaced fixed batches with a memory-budgeted worker pipeline
added per-worker open output handles and output preallocation
wired convert mode to respect --threads while preserving the existing convert() API

I also benchmarked against a fresh master clone at be65ac7, both built RelWithDebInfo with Vulkan enabled, converting safetensors to Q4_K GGUF with --threads 16.

Model	Build	Wall	CPU	Max RSS	Internal timing
SD3.5 Medium	master `be65ac7`	`13.08s`	`1405%`	`3.39 GiB`	load/convert `12.40s`
SD3.5 Medium	streaming `504d5f8`	`12.81s`	`1437%`	`2.22 GiB`	streaming convert `12.60s`
SD3.5 Large	master `be65ac7`	`20.17s`	`909%`	`14.66 GiB`	load/convert `12.88s`
SD3.5 Large	streaming `504d5f8`	`13.10s`	`1345%`	`2.63 GiB`	streaming convert `12.88s`

So the revised pipeline is roughly neutral on SD3.5 Medium and about 35% faster wall-clock on SD3.5 Large in this environment, with substantially lower peak RSS in both cases.

wbruna · 2026-06-04T15:40:43Z

I gave 504d5f8 a try. Looks like it's performing much better now.

The job division still have a few code smells, though. First, you are using separate I/O backends for reading (ifstream) and writing (FILE). If there is a reason to use FILE for writing, it needs to be very clear in comments; otherwise, just use an ofstream. Also, the whole "keep a table of n writers" logic is duplicated between safetensors and gguf.

A better approach could be splitting the writing logic from the model format. Say, an interface similar to:

write_metadata(empty output file, tensor metadata)
write_tensor(output file, tensor, offset)

then you'd have:

gguf_write_metadata(empty output file, tensor metadata)
gguf_write_tensor(output file, tensor, offset)
safetensors_write_metadata(empty output file, tensor metadata)
safetensors_write_tensor(output file, tensor, offset)

(note this could either be a small class hierarchy, or two sets of callbacks using the same signatures. I believe the hierarchy approach would be cleaner; it'd also avoid the need for that function template)

Then you can decouple the whole multithread-writing logic from the output formats:

first, get either a base class pointer or appropriate callback for the requested format
open a single output stream, and use it to call the appropriate 'write_metadata' to write headers, preallocate the file, etc (and before I forget: you'll eventually want something like posix_fallocate to do the preallocation, but that can wait until the basic multi-platform logic is working)
open additional n-1 output streams, so each writing thread calls 'write_tensor' on its own file (and so you avoid the need for that file table)

By the way: with this change, at least the output gguf file isn't identical to the one generated on master. Although it doesn't need to be, it'd be a good safety check, to be sure the code is working as intended.

feat: stream model conversion

1fbb26b

wbruna suggested changes May 30, 2026

View reviewed changes

Refactor streaming conversion pipeline

504d5f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stream model conversion#1581

feat: stream model conversion#1581
shikaku2 wants to merge 2 commits into
leejet:masterfrom
shikaku2:feat/streaming-convert

shikaku2 commented May 29, 2026

Uh oh!

wbruna left a comment

Uh oh!

shikaku2 commented May 31, 2026 •

edited

Loading

Uh oh!

wbruna commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shikaku2 commented May 29, 2026

Summary

What changed

What is not included

Validation

Notes

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

shikaku2 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shikaku2 commented May 31, 2026 •

edited

Loading