gh-150878: Speed up json.dumps(ensure_ascii=False) for long strings#150879
gh-150878: Speed up json.dumps(ensure_ascii=False) for long strings#150879gaborbernat wants to merge 2 commits into
Conversation
…ings escape_size() sizes the ensure_ascii=False encoder output one character at a time; a character needs escaping only when c == '"' || c == '\\' || c < 0x20, and non-ASCII is kept verbatim. For the one-byte representation, detect the no-escape case eight bytes at a time and return the verbatim size directly; a length guard keeps short strings on the original per-character loop. Strings with characters above U+00FF keep the current path. Output is byte-identical, verified against test_json and a 199-case dumps differential in both ensure_ascii modes. dumps of long 1-byte strings runs up to 5.8x faster (4.2x for Latin-1 text); short keys and non-Latin-1 strings are unaffected.
|
Please create tests to exercise those code paths explicitly. |
|
Added |
|
Why has the main code changed just for the new test? |
Cover long runs that cross the scan windows and the short-string guard, with a special character at every offset in 1-byte and wider strings, plus the no-escape verbatim fast path and the escaped fallback.
30d1025 to
27a63b9
Compare
|
Please do not force push. |
Sorry only did it because my comment before had some unwanted contents in it by mistake. |
|
Good catch, that was accidental. A |
|
Please avoid having your agents run in automated mode. We do like the work but if it is just automated, we do not. And just do NOT force push. A single commit to revert a wrong addition is enough. |
When
json.dumpsruns withensure_ascii=False, it sizes each escaped string one character at a time inescape_size, after whichwrite_escaped_unicodecopies the string verbatim when nothing needs escaping. In this mode a character needs escaping only whenc == '"',c == '\\', orc < 0x20; non-ASCII is kept verbatim. For a long string with no such character, common for text values including Western-European (Latin-1) text, that per-character sizing scan is pure overhead before the verbatim copy.This detects the no-escape case on the one-byte (Latin-1) representation eight bytes at a time, returning the verbatim size after about one eighth of the work. It is the
ensure_ascii=Falsecounterpart to #150876; with the decode-side scan in #150872 the three changes cover JSON string scanning end to end, on three different code paths.What we do now (scalar, one code point at a time)
A byte needs escaping when
c == '"' || c == '\\' || c < 0x20. For a long string with none of those, this reads and tests every byte just to learn the output is the input plus two quotes.What SWAR does (8 bytes at a time, in one register)
SWAR is "SIMD within a register": load 8 bytes into a single
uint64_tand test all 8 lanes at once with ordinary integer ops.haszero(v) = (v - 0x0101…) & ~v & 0x8080…lights the high bit of exactly the zero lanes, with no false positives or negatives. Broadcasting a byte (b * 0x0101…) and XOR-ing turns "equals b" into "is zero";< 0x20is "top three bits all zero", detected ashaszero(w & 0xE0…). A Latin-1 byte (>= 0x80) is not in this set, so long runs of European text skip eight at a time too. At the first lane that needs escaping the loop breaks and the existing per-character loop computes the exact size and does the work. A length guard keeps short strings (the common dict key) on the original loop, where the fast path's setup would not pay off.These are the same
0x0101…/0x8080…masks thatObjects/unicodeobject.candObjects/stringlib/find_max_char.halready use for ASCII scanning.When and how this changes performance
json.dumps(..., ensure_ascii=False), current encoder versus this change:This change is confined to
ensure_ascii=False, the non-default mode, so it reaches fewer callers than the default-path change in #150876; within that mode the win matches.Correctness
Output is byte-identical to the current encoder. Verified against the full
test_jsonsuite and a 199-case differential corpus that places each escape-relevant character (",\\, control chars, and characters above U+007F) at every offset across the eight-byte window, in bothensure_ascii=Trueandensure_ascii=Falsemodes. Every output matched.Benchmark
References for the bit tricks: Sean Anderson, Bit Twiddling Hacks (zero byte, byte equal to n); Henry S. Warren Jr., Hacker's Delight, 2nd ed., chapter 6.
It is not the SIMD parsing backend from #142915: it adds no intrinsics, no CPU detection, and no build configuration, and it does not depend on #125022.
Resolves #150878.