gh-150871: Speed up JSON string decoding for long ASCII strings#150872
Open
gaborbernat wants to merge 3 commits into
Open
gh-150871: Speed up JSON string decoding for long ASCII strings#150872gaborbernat wants to merge 3 commits into
gaborbernat wants to merge 3 commits into
Conversation
scanstring_unicode scans each JSON string one character at a time for the closing quote, a backslash, or a control character. For the one-byte (ASCII/Latin-1) representation, skip eight bytes at a time with a word-at-a-time test using the same masks Objects/unicodeobject.c applies for ASCII scanning; the existing per-character loop then pins the exact byte and performs every decode decision. Two-byte and four-byte strings keep the current loop. Output is byte-identical, verified against test_json, a 347-input differential corpus, and all 340 nst/JSONTestSuite files. Long ASCII string values decode up to 6.3x faster; short keys, numbers, and non-Latin-1 strings are unaffected.
This was referenced Jun 3, 2026
This was referenced Jun 3, 2026
Cover long runs that cross the scan windows with a terminator, backslash escape and \uXXXX escape at every offset in 1-byte and wider strings, plus strict and non-strict control-character handling at the window boundaries.
Contributor
Author
|
Added |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
JSON documents whose payload is long text (log lines, description and content fields, base64 or embedded-document values) spend most of their decode time finding the end of each string. This speeds that up by scanning eight bytes at a time instead of one, with no SIMD intrinsics and no CPU detection, and it is byte-identical to the current decoder.
What we do now (scalar, one code point at a time)
The current
scanstring_unicodeinner loop:For a 64-character string with no escapes that is 64 iterations, each doing one read, up to three comparisons, and a loop branch: about 64 × 4 operations just to find the closing
". The CPU walks the string one byte at a time while its 64-bit registers sit mostly idle.What SWAR does (8 bytes at a time, in one register)
SWAR is "SIMD within a register": load 8 bytes into a single
uint64_tand test all 8 lanes at once with ordinary integer ops.Two tricks make this work:
0x22 * 0x0101010101010101puts"(0x22) into all 8 byte lanes. XOR-ing the word with it turns "this lane equals"" into "this lane is zero".haszero(v)= (v - 0x0101…) & ~v & 0x8080…: a zero lane borrows and lights its high bit, and the masks isolate exactly the zero lanes, with no false positives or negatives. For control characters it usesw & 0xE0, since a byte is< 0x20exactly when its top three bits are zero, again detected byhaszero.One loop iteration answers "do any of these 8 bytes need attention?" in about 6 integer ops. If not, it advances 8 bytes, so the 64-character string takes 8 iterations instead of 64.
The key design point: SWAR only skips, it never decides
When the mask is nonzero, meaning a
",\\, or control char sits somewhere in the 8-byte window, the loop breaks and the original scalar loop re-scans those 8 bytes to find the exact position and do the actual work: terminate the string, handle the escape, or raise the error at the right index. Every decode decision stays on the proven scalar path. SWAR is purely a fast-forward over the runs of ordinary characters that make up the bulk of most strings.Worked example
Chunk
hello wo(8 ordinary ASCII bytes):memcpy, 3haszerotests, all zero,next += 8. One iteration.Chunk
lo","wor(a"at offset 2):mqis nonzero, so it breaks. The scalar loop walksl,o, hits"at offset 2: identical to today, reached after skipping the prior runs.Why it is a win, and its limits
ASCII_CHAR_MASK = 0x8080…,VECTOR_0101 = 0x0101…inunicodeobject.c, andUCS1_ASCII_CHAR_MASKinfind_max_char.h), so it is not exotic for this codebase.Mental model: today we ask "is this byte special?" once per byte; SWAR asks "is any of these eight bytes special?" once per eight, and drops to per-byte only when the answer is yes.
When and how this changes performance
json.loads, current decoder versus this change:The standard
pyperformance bm_json_loadsdocument is short-string and dict dominated and shows no change. The benefit lands on documents whose strings are long.Correctness
Output is byte-identical to the current decoder, error positions included. Verified three ways: the full
test_jsonsuite; a 347-input differential corpus (real-world JSON, plus a quote, backslash, raw control character, escape, and\uXXXXplaced at every offset across the eight-byte window in all three string representations, plus surrogate pairs, lone surrogates, embedded nulls, and truncated escapes); and all 340 files ofnst/JSONTestSuite(318 parsing and 22 transform, including the must-reject and implementation-defined cases). Every value and every raised error position matched the current implementation.Benchmark
References for the bit tricks: Sean Anderson, Bit Twiddling Hacks (zero byte, byte equal to n, byte less than n); Henry S. Warren Jr., Hacker's Delight, 2nd ed., chapter 6.
It is not the SIMD parsing backend from #142915: it adds no intrinsics, no CPU detection, and no build configuration, and it does not depend on #125022.
Resolves #150871.