Skip to content

test(retry/e2e): deflake retry-count tests by disabling telemetry#822

Merged
vikrantpuppala merged 1 commit into
mainfrom
deflake/retry-tests-disable-telemetry
Jun 3, 2026
Merged

test(retry/e2e): deflake retry-count tests by disabling telemetry#822
vikrantpuppala merged 1 commit into
mainfrom
deflake/retry-tests-disable-telemetry

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Problem

The retry tests in PySQLRetryTestsMixin (tests/e2e/common/retry_test_mixins.py) patch urllib3 globally (HTTPSConnectionPool._get_conn / _validate_conn) and assert the connection's own request was retried exactly N times (e.g. assert mock.call_count == 6). Under merge-queue load these flake with AssertionError because the same connection makes side-channel HTTP calls through the same mocked pool, inflating the count.

Seen on run 26867475731 / job 79234339492 (test_oserror_retries[extra_params0/1], test_retry_max_count_not_exceeded[extra_params0]) — the re-run passed, confirming flakiness. CI logs show interleaved /telemetry-ext POSTs and feature-flag GETs during the test window.

There are two leak sources, both triggered by the default enable_telemetry=True:

  1. Synchronous feature-flag GET during Connection.__init__ (TelemetryHelper.is_telemetry_enabledFeatureFlagsContext._refresh_flags_http_client.request(GET, ...)) — deterministic, and not addressed by the prior mitigation.
  2. Async telemetry POSTs to /telemetry-ext from the background executor + periodic flush thread.

The prior fix (_isolated_from_telemetry, #812) only attacked source #2, and imperfectly — so these kept flaking.

Fix (test-only, no production changes)

  • Add enable_telemetry=False to the shared _retry_policy dict. It short-circuits is_telemetry_enabled before the feature-flag fetch (kills [Cleanup] Set up poetry, lockfile, mypy, and black #1) and installs a NoopTelemetryClient (kills [Cleanup] Move connector files to root #2), removing both sources at their origin without changing the retry behavior under test. Every test in the mixin merges _retry_policy into its connection params, so this single line covers all ~13 tests.
  • Harden _isolated_from_telemetry() with a backstop that no-ops the actual HTTP boundary (_send_with_unified_client), closing the already-submitted-future gap for any other caller of the helper.
  • Add a version_string forward-compat shim to the SimpleHttpResponse mock so the suite also runs on urllib3 ≥ 2.3 (harmless on the 2.2.x pinned in CI; surfaced during local verification).

Verification

Run against dogfood:

  • test_oserror_retries and test_retry_max_count_not_exceeded3/3 deterministic passes for both Thrift and SEA params.
  • Full PySQLRetryTestsMixin36 tests, all pass.

The retry behavior under test (the connection's own retried-request count) is unchanged; only the telemetry/feature-flag side-channel traffic is removed.

This pull request and its description were written by Isaac.

The retry tests in PySQLRetryTestsMixin patch urllib3 globally
(HTTPSConnectionPool._get_conn / _validate_conn) and assert the
connection's own request was retried exactly N times. With telemetry on
(the default), the same connection makes two kinds of side-channel HTTP
calls through the same mocked pool that inflate the count and flake the
assertion (e.g. `assert call_count == 6` -> AssertionError under
merge-queue load):

  1. A synchronous feature-flag GET during Connection.__init__
     (is_telemetry_enabled -> FeatureFlagsContext._refresh_flags), and
  2. Async telemetry POSTs to /telemetry-ext from the background executor
     and the periodic flush thread.

The prior mitigation (_isolated_from_telemetry, #812) only attacked
source #2 and imperfectly, so the tests kept flaking.

Fix (test-only):
  - Add enable_telemetry=False to the shared _retry_policy dict. It
    short-circuits is_telemetry_enabled before the feature-flag fetch
    (kills #1) and installs a NoopTelemetryClient (kills #2), removing
    both sources at their origin without changing the retry behavior
    under test. Every test in the mixin merges _retry_policy, so this
    one line covers them all.
  - Harden _isolated_from_telemetry with a backstop that no-ops the
    actual HTTP boundary (_send_with_unified_client), closing the
    already-submitted-future gap for any other caller of the helper.
  - Add a version_string forward-compat shim to the SimpleHttpResponse
    mock so the suite also runs on urllib3 >= 2.3 (harmless on the
    2.2.x pinned in CI).

Verified against dogfood: test_oserror_retries and
test_retry_max_count_not_exceeded pass 3/3 deterministically for both
Thrift and SEA; the full PySQLRetryTestsMixin (36 tests) passes.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala force-pushed the deflake/retry-tests-disable-telemetry branch from 7f26f09 to ca0bafb Compare June 3, 2026 08:56
@vikrantpuppala vikrantpuppala added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 5292fc1 Jun 3, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants