Skip to content

Handle repo patch with non-UTF8 sequences#3918

Merged
un-def merged 1 commit into
masterfrom
issue_3880_fix_repo_non_utf8_patch
Jun 2, 2026
Merged

Handle repo patch with non-UTF8 sequences#3918
un-def merged 1 commit into
masterfrom
issue_3880_fix_repo_non_utf8_patch

Conversation

@un-def
Copy link
Copy Markdown
Collaborator

@un-def un-def commented May 29, 2026

Git text diff is not actually text in the usual sense -- it may contain any nonprintable bytes.

But since we send patches as files (multipart/form-data), not as JSON strings, we don't need to decode them at all.

In addition, the missing --binary flag was added.
Without this flag, modified binary files were effectively excluded from the patch with "Binary files a and b differ".

Fixes: #3880

Git text diff is not actually text in the usual sense -- it may contain
any nonprintable bytes.

But since we send patches as files (multipart/form-data), not
as JSON strings, we don't need to decode them at all.

In addition, the missing --binary flag was added.
Without this flag, modified binary files were effectively excluded
from the patch with "Binary files a and b differ".

Fixes: #3880
@un-def un-def requested a review from jvstme May 29, 2026 16:10
repo_branch: Optional[str] = None
repo_hash: Optional[str] = None
repo_diff: Annotated[Optional[str], Field(exclude=True)] = None
repo_diff: Annotated[Optional[bytes], Field(exclude=True)] = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Technically a breaking change for SDK users, since RemoteRunRepoData is stored in the public but undocumented run_repo_data attribute of the public RemoteRepo class

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but I'd consider all undocumented fields as private

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the PR also partially fixes and partially changes the behavior (from one unexpected to another unexpected) described #1679.

  • Locally committed binary files are now delivered as expected.

  • Locally committed large text files now cause dstack apply to fail (previously, such files would be delivered with no contents).

    $ dstack apply
    GitCommandError: Cmd('git') failed due to: exit code(128)
      cmdline: git diff --binary 623de5df3d3d13007661c7ee1a13a8d5197e734a
      stderr: 'fatal: unable to generate diff for test-file
    '
    $ ls -lh test-file 
    -rw-r--r--. 1 root root 1.1G Jun  1 14:08 test-file

Copy link
Copy Markdown
Collaborator Author

@un-def un-def Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that large ASCII files (sometimes) are detected as binary files.

Note that both produce Binary files ... differ without --binary but with --binary a "truly" binary file is handled correctly despite being larger (2GB vs 1.4GB) while with a base64-encoded file Git fails with fatal: unable to generate diff . Weird.

dd if=/dev/urandom bs=1M count=1000 | base64 > blob.asc
dd if=/dev/urandom bs=1M count=2000 > blob.bin

ls -l blob.*
-rw-rw-r-- 1 def def 1.4G Jun  2 09:33 blob.asc
-rw-rw-r-- 1 def def 2.0G Jun  2 09:10 blob.bin

file blob.asc
blob.asc: ASCII text

git --no-pager diff --no-index -- /dev/null blob.asc
diff --git a/blob.asc b/blob.asc
new file mode 100644
index 0000000..f2d7c9e
Binary files /dev/null and b/blob.asc differ

git --no-pager diff --no-index --binary -- /dev/null blob.asc
diff --git a/blob.asc b/blob.asc
new file mode 100644
index 0000000..f2d7c9e
fatal: unable to generate diff for /dev/null

file blob.bin
blob.bin: data

git --no-pager diff --no-index -- /dev/null blob.bin
diff --git a/blob.bin b/blob.bin
new file mode 100644
index 0000000..f71bad7
Binary files /dev/null and b/blob.bin differ

git --no-pager diff --no-index --binary -- /dev/null blob.bin
diff --git a/blob.bin b/blob.bin
new file mode 100644
index 0000000000000000000000000000000000000000..f71bad7e725b891b8c0260707b8c34481f3da331
GIT binary patch
<ommitted>

literal 0
HcmV?d00001

Another unexpected behavior still present:

  • Uncommitted (untracked) large text files are empty (git diff fails but its exit status is ignored due to ignore_status=True):

    dd if=/dev/random bs=1M count=1000 | base64 > blob.asc
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB, 1000 MiB) copied, 4.00187 s, 262 MB/s
    
    git --no-pager diff --no-index --binary -- /dev/null blob.asc
    diff --git a/blob.asc b/blob.asc
    new file mode 100644
    index 0000000..ce045ae
    fatal: unable to generate diff for /dev/null

@un-def un-def merged commit 0750aff into master Jun 2, 2026
25 checks passed
@un-def un-def deleted the issue_3880_fix_repo_non_utf8_patch branch June 2, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: dstack apply fails with UnicodeDecodeError when a modified tracked file contains non-UTF-8 bytes

2 participants