test(retry/e2e): deflake retry-count tests by disabling telemetry#822
Merged
Conversation
The retry tests in PySQLRetryTestsMixin patch urllib3 globally
(HTTPSConnectionPool._get_conn / _validate_conn) and assert the
connection's own request was retried exactly N times. With telemetry on
(the default), the same connection makes two kinds of side-channel HTTP
calls through the same mocked pool that inflate the count and flake the
assertion (e.g. `assert call_count == 6` -> AssertionError under
merge-queue load):
1. A synchronous feature-flag GET during Connection.__init__
(is_telemetry_enabled -> FeatureFlagsContext._refresh_flags), and
2. Async telemetry POSTs to /telemetry-ext from the background executor
and the periodic flush thread.
The prior mitigation (_isolated_from_telemetry, #812) only attacked
source #2 and imperfectly, so the tests kept flaking.
Fix (test-only):
- Add enable_telemetry=False to the shared _retry_policy dict. It
short-circuits is_telemetry_enabled before the feature-flag fetch
(kills #1) and installs a NoopTelemetryClient (kills #2), removing
both sources at their origin without changing the retry behavior
under test. Every test in the mixin merges _retry_policy, so this
one line covers them all.
- Harden _isolated_from_telemetry with a backstop that no-ops the
actual HTTP boundary (_send_with_unified_client), closing the
already-submitted-future gap for any other caller of the helper.
- Add a version_string forward-compat shim to the SimpleHttpResponse
mock so the suite also runs on urllib3 >= 2.3 (harmless on the
2.2.x pinned in CI).
Verified against dogfood: test_oserror_retries and
test_retry_max_count_not_exceeded pass 3/3 deterministically for both
Thrift and SEA; the full PySQLRetryTestsMixin (36 tests) passes.
Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
7f26f09 to
ca0bafb
Compare
msrathore-db
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The retry tests in
PySQLRetryTestsMixin(tests/e2e/common/retry_test_mixins.py) patch urllib3 globally (HTTPSConnectionPool._get_conn/_validate_conn) and assert the connection's own request was retried exactly N times (e.g.assert mock.call_count == 6). Under merge-queue load these flake withAssertionErrorbecause the same connection makes side-channel HTTP calls through the same mocked pool, inflating the count.Seen on run 26867475731 / job 79234339492 (
test_oserror_retries[extra_params0/1],test_retry_max_count_not_exceeded[extra_params0]) — the re-run passed, confirming flakiness. CI logs show interleaved/telemetry-extPOSTs and feature-flag GETs during the test window.There are two leak sources, both triggered by the default
enable_telemetry=True:Connection.__init__(TelemetryHelper.is_telemetry_enabled→FeatureFlagsContext._refresh_flags→_http_client.request(GET, ...)) — deterministic, and not addressed by the prior mitigation./telemetry-extfrom the background executor + periodic flush thread.The prior fix (
_isolated_from_telemetry, #812) only attacked source #2, and imperfectly — so these kept flaking.Fix (test-only, no production changes)
enable_telemetry=Falseto the shared_retry_policydict. It short-circuitsis_telemetry_enabledbefore the feature-flag fetch (kills [Cleanup] Set up poetry, lockfile, mypy, and black #1) and installs aNoopTelemetryClient(kills [Cleanup] Move connector files to root #2), removing both sources at their origin without changing the retry behavior under test. Every test in the mixin merges_retry_policyinto its connection params, so this single line covers all ~13 tests._isolated_from_telemetry()with a backstop that no-ops the actual HTTP boundary (_send_with_unified_client), closing the already-submitted-future gap for any other caller of the helper.version_stringforward-compat shim to theSimpleHttpResponsemock so the suite also runs on urllib3 ≥ 2.3 (harmless on the 2.2.x pinned in CI; surfaced during local verification).Verification
Run against dogfood:
test_oserror_retriesandtest_retry_max_count_not_exceeded— 3/3 deterministic passes for both Thrift and SEA params.PySQLRetryTestsMixin— 36 tests, all pass.The retry behavior under test (the connection's own retried-request count) is unchanged; only the telemetry/feature-flag side-channel traffic is removed.
This pull request and its description were written by Isaac.