Handle repo patch with non-UTF8 sequences#3918
Conversation
Git text diff is not actually text in the usual sense -- it may contain any nonprintable bytes. But since we send patches as files (multipart/form-data), not as JSON strings, we don't need to decode them at all. In addition, the missing --binary flag was added. Without this flag, modified binary files were effectively excluded from the patch with "Binary files a and b differ". Fixes: #3880
| repo_branch: Optional[str] = None | ||
| repo_hash: Optional[str] = None | ||
| repo_diff: Annotated[Optional[str], Field(exclude=True)] = None | ||
| repo_diff: Annotated[Optional[bytes], Field(exclude=True)] = None |
There was a problem hiding this comment.
(nit) Technically a breaking change for SDK users, since RemoteRunRepoData is stored in the public but undocumented run_repo_data attribute of the public RemoteRepo class
There was a problem hiding this comment.
Good point, but I'd consider all undocumented fields as private
There was a problem hiding this comment.
It looks like the PR also partially fixes and partially changes the behavior (from one unexpected to another unexpected) described #1679.
-
Locally committed binary files are now delivered as expected.
-
Locally committed large text files now cause
dstack applyto fail (previously, such files would be delivered with no contents).$ dstack apply GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git diff --binary 623de5df3d3d13007661c7ee1a13a8d5197e734a stderr: 'fatal: unable to generate diff for test-file ' $ ls -lh test-file -rw-r--r--. 1 root root 1.1G Jun 1 14:08 test-file
There was a problem hiding this comment.
It seems that large ASCII files (sometimes) are detected as binary files.
Note that both produce Binary files ... differ without --binary but with --binary a "truly" binary file is handled correctly despite being larger (2GB vs 1.4GB) while with a base64-encoded file Git fails with fatal: unable to generate diff . Weird.
dd if=/dev/urandom bs=1M count=1000 | base64 > blob.asc
dd if=/dev/urandom bs=1M count=2000 > blob.bin
ls -l blob.*
-rw-rw-r-- 1 def def 1.4G Jun 2 09:33 blob.asc
-rw-rw-r-- 1 def def 2.0G Jun 2 09:10 blob.bin
file blob.asc
blob.asc: ASCII text
git --no-pager diff --no-index -- /dev/null blob.asc
diff --git a/blob.asc b/blob.asc
new file mode 100644
index 0000000..f2d7c9e
Binary files /dev/null and b/blob.asc differ
git --no-pager diff --no-index --binary -- /dev/null blob.asc
diff --git a/blob.asc b/blob.asc
new file mode 100644
index 0000000..f2d7c9e
fatal: unable to generate diff for /dev/null
file blob.bin
blob.bin: data
git --no-pager diff --no-index -- /dev/null blob.bin
diff --git a/blob.bin b/blob.bin
new file mode 100644
index 0000000..f71bad7
Binary files /dev/null and b/blob.bin differ
git --no-pager diff --no-index --binary -- /dev/null blob.bin
diff --git a/blob.bin b/blob.bin
new file mode 100644
index 0000000000000000000000000000000000000000..f71bad7e725b891b8c0260707b8c34481f3da331
GIT binary patch
<ommitted>
literal 0
HcmV?d00001Another unexpected behavior still present:
-
Uncommitted (untracked) large text files are empty (
git difffails but its exit status is ignored due toignore_status=True):dd if=/dev/random bs=1M count=1000 | base64 > blob.asc 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.00187 s, 262 MB/s git --no-pager diff --no-index --binary -- /dev/null blob.asc diff --git a/blob.asc b/blob.asc new file mode 100644 index 0000000..ce045ae fatal: unable to generate diff for /dev/null
Git text diff is not actually text in the usual sense -- it may contain any nonprintable bytes.
But since we send patches as files (multipart/form-data), not as JSON strings, we don't need to decode them at all.
In addition, the missing --binary flag was added.
Without this flag, modified binary files were effectively excluded from the patch with "Binary files a and b differ".
Fixes: #3880