feat: stream model conversion#1581
Conversation
wbruna
left a comment
There was a problem hiding this comment.
First, about the coding style: this is placing format-specific logic inside convert.cpp. The format-specific code should go to the appropriate files inside model_io/, likely with a separate "write tensor" per file type. Note you should also avoid opening and closing the model files for each tensor, so some kind of "opened model file" abstraction will probably be needed. A "read the tensor at the specified offset" abstraction would probably make sense, too.
I gave this a try for a .safetensors -> Q4_K .gguf. On my machine, it was never able to saturate all CPU cores, so it got much slower than the normal conversion (around 1/2 - 1/3 speed). I/O didn't seem to be the bottleneck: system and wait times remained low.
Looking at the code, my guess would be the batching calculation: it would explain this behavior if for some reason it consistently used only 1 or 2 threads (the number of threads should also respect the --threads parameter by the way). The batching division also looks sub-optimal: you split up work between threads, then stop everything, write everything, then open threads again. So you are not allowing an overlap between the conversion and the writing; plus, a thread could finish much sooner than the others, and would stay idle until the next batch.
I would avoid the fixed batching, and use a true pipeline instead: either n read+convert threads + 1 write thread, or n read+convert+write threads, controlling for the memory budget with a condition variable. I would bet on the second option: if writing is the bottleneck, you'd naturally parallelize it as well.
Note you are not forced to write sequentially, either: you have offsets for each tensor, so they could be written as soon as they are ready, with each thread using its own open file object (I'd recommend preallocating the file at the beginning, to give the filesystem a better chance to avoid fragmentation issues). An out-of-order approach could also help with models with huge tensors, since you can try to overlap them with smaller ones.
|
Added a follow-up commit (
I also benchmarked against a fresh master clone at
So the revised pipeline is roughly neutral on SD3.5 Medium and about 35% faster wall-clock on SD3.5 Large in this environment, with substantially lower peak RSS in both cases. |
|
I gave 504d5f8 a try. Looks like it's performing much better now. The job division still have a few code smells, though. First, you are using separate I/O backends for reading (ifstream) and writing (FILE). If there is a reason to use FILE for writing, it needs to be very clear in comments; otherwise, just use an ofstream. Also, the whole "keep a table of n writers" logic is duplicated between safetensors and gguf. A better approach could be splitting the writing logic from the model format. Say, an interface similar to:
then you'd have:
(note this could either be a small class hierarchy, or two sets of callbacks using the same signatures. I believe the hierarchy approach would be cleaner; it'd also avoid the need for that function template) Then you can decouple the whole multithread-writing logic from the output formats:
By the way: with this change, at least the output gguf file isn't identical to the one generated on master. Although it doesn't need to be, it'd be a good safety check, to be sure the code is working as intended. |
Split out from draft PR #1573: #1573
Summary
Changes
--convertto stream converted tensors instead of allocating the entire converted model in oneggml_contextbefore writing the output file.This PR intentionally only covers the regular conversion memory/threading path. RMSE-guided conversion is not included here and will be handled separately after this is reviewed.
What changed
convert(input_path, vae_path, output_path, output_type, tensor_type_rules, convert_name)API and CLI behavior.What is not included
--lazy-loadruntime behavior changes.Validation
cmake --build build -j16git diff --checkbuild/bin/sd-cli -M convert -m /tmp/sdcpp-convert-tiny.safetensors -o /tmp/sdcpp-convert-tiny-final.gguf --type f16time build/bin/sd-cli -M convert -m /home/aaron/models/sd3.5-medium/sd3.5_medium.safetensors -o /tmp/sd3.5_medium_streaming_convert.gguf/tmp/sd3.5_medium_streaming_convert.gguf, 4.8GNotes
This is a draft because the new streaming writer path should get review and broader testing across output formats and platforms before being marked ready.