Skip to content

test: shave register/shutdown/TC waits on loopback#1711

Closed
bluetoothbot wants to merge 3 commits into
python-zeroconf:masterfrom
bluetoothbot:koan/speed-up-register-random-wait
Closed

test: shave register/shutdown/TC waits on loopback#1711
bluetoothbot wants to merge 3 commits into
python-zeroconf:masterfrom
bluetoothbot:koan/speed-up-register-random-wait

Conversation

@bluetoothbot
Copy link
Copy Markdown
Contributor

@bluetoothbot bluetoothbot commented May 18, 2026

What

Speed up four slow tests by extracting one production constant (_REGISTER_RANDOM_INTERVAL, the RFC 6762 §8.1 random pre-probe wait in async_check_service) and patching it through quick_timing on loopback, plus three localised test-only patches.

Part of #1707.

Why

async_check_service waits a random 150-250ms before probing — RFC 6762 §8.1 thundering-herd avoidance for real networks, pure overhead on 127.0.0.1. Tests that use quick_timing already get probe/announce/goodbye intervals shrunk, but the inline random.randint on this path is opaque to patching. Three of the slowest tests in #1707 without an in-flight PR (test_async_service_registration_name_conflict, test_name_conflicts, and everything that registers a service) pay this for nothing.

test_shutdown_loop and test_tc_bit_defers_last_response_missing carry similar dead waits — a 1s outer timeout in shutdown_loop against a never-completing pending task, and an 800ms cleanup poll against a 400-500ms RFC 6762 §7.2 TC-bit defer timer.

How

  1. Extract _REGISTER_RANDOM_INTERVAL = (150, 250) as a module-level constant in _core.py and patch it in quick_timing alongside the other loopback shaves. No production change — default still matches the RFC value.
  2. test_shutdown_loop: patch _TASK_AWAIT_TIMEOUT to 0.1s and drop .result(1).result(0.1). Both were just upper bounds on a never-completing pending task.
  3. test_tc_bit_defers_last_response_missing: patch _TC_DELAY_RANDOM_INTERVAL to (10, 20)ms and tighten the cleanup poll loop.
  4. test_integration: wait for the browser's first startup query to fire (cache empty) before registering. The original test relied on the 150-250ms register-random-wait being longer than the browser's 20-120ms first-query delay; with the new fast register that race flipped and the first query started picking up the known PTR via §7.1 known-answer suppression.

Speedups (CPython 3.12, loopback)

test before after
test_shutdown_loop 1.16s 0.26s
test_tc_bit_defers_last_response_missing 0.79s 0.32s
test_async_service_registration_name_conflict 0.72s 0.10s
test_name_conflicts 0.50s 0.10s

Plus a smaller transparent shave on every other quick_timing-using test that registers a service.

Test plan

  • Targeted tests pass three times in a row
  • Full tests/ suite passes (338 passed, 2 skipped, ~47s wall)
  • Pre-commit hooks pass (ruff, mypy, flake8, codespell, pyupgrade)

Quality Report

Changes: 6 files changed, 31 insertions(+), 12 deletions(-)

Code scan: clean

Tests: passed (4 PASSED)

Branch hygiene: clean

Generated by Kōan post-mission quality pipeline

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (4bae30a) to head (98c3785).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1711   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3426     3427    +1     
  Branches      471      471           
=======================================
+ Hits         3418     3419    +1     
  Misses          5        5           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing bluetoothbot:koan/speed-up-register-random-wait (98c3785) with master (4bae30a)

Open in CodSpeed

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 18, 2026

all the windows TC tests fail with this pr

    zeroconf:outgoing.py:471 now offsets = questions=1, answers=302, authorities=0, additionals=0
================================== FAILURES ===================================
__________________ test_tc_bit_defers_last_response_missing ___________________

    @patch.object(_listener, "_TC_DELAY_RANDOM_INTERVAL", (10, 20))
    def test_tc_bit_defers_last_response_missing():
        zc = Zeroconf(interfaces=["127.0.0.1"])
        _wait_for_start(zc)
        type_ = "_knowndefer._tcp.local."
        name = "knownname"
        name2 = "knownname2"
        name3 = "knownname3"
    
        registration_name = f"{name}.{type_}"
        registration2_name = f"{name2}.{type_}"
        registration3_name = f"{name3}.{type_}"
    
        desc = {"path": "/~paulsm/"}
        server_name = "ash-2.local."
        server_name2 = "ash-3.local."
        server_name3 = "ash-4.local."
    
        info = r.ServiceInfo(
            type_,
            registration_name,
            80,
            0,
            0,
            desc,
            server_name,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        info2 = r.ServiceInfo(
            type_,
            registration2_name,
            80,
            0,
            0,
            desc,
            server_name2,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        info3 = r.ServiceInfo(
            type_,
            registration3_name,
            80,
            0,
            0,
            desc,
            server_name3,
            addresses=[socket.inet_aton("10.0.1.2")],
        )
        zc.registry.async_add(info)
        zc.registry.async_add(info2)
        zc.registry.async_add(info3)
    
        protocol = zc.engine.protocols[0]
        now = r.current_time_millis()
        _clear_cache(zc)
        source_ip = "203.0.113.12"
    
        generated = r.DNSOutgoing(const._FLAGS_QR_QUERY)
        question = r.DNSQuestion(type_, const._TYPE_PTR, const._CLASS_IN)
        generated.add_question(question)
        for _ in range(300):
            # Add so many answers we end up with another packet
            generated.add_answer_at_time(info.dns_pointer(), now)
        generated.add_answer_at_time(info2.dns_pointer(), now)
        generated.add_answer_at_time(info3.dns_pointer(), now)
        packets = generated.packets()
        assert len(packets) == 4
        expected_deferred = []
    
        next_packet = r.DNSIncoming(packets.pop(0))
        expected_deferred.append(next_packet)
        threadsafe_query(zc, protocol, next_packet, source_ip, const._MDNS_PORT, Mock(), ())
        assert protocol._deferred[source_ip] == expected_deferred
        timer1 = protocol._timers[source_ip]
    
        next_packet = r.DNSIncoming(packets.pop(0))
        expected_deferred.append(next_packet)
        threadsafe_query(zc, protocol, next_packet, source_ip, const._MDNS_PORT, Mock(), ())
>       assert protocol._deferred[source_ip] == expected_deferred
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       KeyError: '203.0.113.12'

try harder

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 18, 2026

@bluetoothbot rr

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

bluetoothbot commented May 18, 2026

PR Review — test: shave register/shutdown/TC waits on loopback

Mechanics are clean and the speedups are real: extracting _REGISTER_RANDOM_INTERVAL to a module constant keeps production behavior identical while letting quick_timing patch it on loopback, and the test_integration got_query.wait() synchronization correctly closes the race the faster register introduced. The blocking issue is what @bdraco already flagged: test_tc_bit_defers_last_response_missing is failing on the Windows CI matrix because the new _TC_DELAY_RANDOM_INTERVAL patch to (10, 20)ms is below Windows' default ~15.6ms timer tick. That patch isn't visible in the truncated diff but it's the load-bearing change for that test. Once that interval is widened enough to clear Windows clock granularity (see reply), this is mergeable. Secondary nit: the 0.1s _TASK_AWAIT_TIMEOUT reduction is aggressive — worth widening for safety on slow CI entries.


🟢 Suggestions

1. Verify 0.1s timeout headroom on slow CI/PyPy (`tests/utils/test_asyncio.py`, L68)

Both the _TASK_AWAIT_TIMEOUT patch and the .result(0.1) reduction take the bound down by 10x. On a heavily loaded CI worker (especially the free-threaded 3.14t entry or PyPy) a 100ms window for the executor thread to schedule, dispatch run_coroutine_threadsafe, and observe the pending future could be tight enough to occasionally race. The original 1s was conservative for exactly that reason. Consider 0.25s — still a ~4x speedup, far more headroom for scheduler jitter. Worth watching the next few CI runs for flakes on the slowest matrix entries.

@patch.object(aioutils, "_TASK_AWAIT_TIMEOUT", 0.1)
def test_shutdown_loop() -> None:
    ...
        asyncio.run_coroutine_threadsafe(_still_running(), loop).result(0.1)

Checklist

  • Production behavior preserved (constant default matches old inline literal)
  • Test races eliminated by explicit synchronization
  • Platform portability of new timing constants — existing comment from @bdraco — Windows TC test failure
  • Sufficient timeout headroom for slow CI / PyPy / free-threaded matrix — suggestion #1
  • Free-threaded (3.14t) safety — no new module-level mutable state
  • RFC citations preserved on timing comments
  • No Co-Authored-By trailer in commits
  • Conventional Commits PR title with lowercase subject

Summary

Mechanics are clean and the speedups are real: extracting _REGISTER_RANDOM_INTERVAL to a module constant keeps production behavior identical while letting quick_timing patch it on loopback, and the test_integration got_query.wait() synchronization correctly closes the race the faster register introduced. The blocking issue is what @bdraco already flagged: test_tc_bit_defers_last_response_missing is failing on the Windows CI matrix because the new _TC_DELAY_RANDOM_INTERVAL patch to (10, 20)ms is below Windows' default ~15.6ms timer tick. That patch isn't visible in the truncated diff but it's the load-bearing change for that test. Once that interval is widened enough to clear Windows clock granularity (see reply), this is mergeable. Secondary nit: the 0.1s _TASK_AWAIT_TIMEOUT reduction is aggressive — worth widening for safety on slow CI entries.


Automated review by Kōana22d068
6b84b72
98c3785

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

@bdraco: all the windows TC tests fail with this pr

The Windows failure is almost certainly the combination of two compounding factors in test_tc_bit_defers_last_response_missing:

  1. _TC_DELAY_RANDOM_INTERVAL=(10, 20) is below the Windows scheduler tick (~15.6ms). The asyncio call_later driving the TC defer timer floors at the next tick, so the effective fire time is 15-30ms+ rather than the 10-20ms the patch suggests.

  2. The cleanup poll budget was halvedrange(8) × sleep(0.1) = 800ms became range(20) × sleep(0.02) = 400ms. Under Windows' coarser time.sleep granularity the wall clock is closer to 620ms, but the asyncio side is racing the same tick and can easily overshoot.

Two options to fix:

  • Keep the 800ms cleanup budget (range(40) × sleep(0.02) if you want the finer-grained early exit), or
  • Widen the patched random interval to something like (20, 50) or (30, 60) ms — still a 10x speedup over the production (400, 500) ms, but well clear of the Windows tick floor so the timing is deterministic across the CI matrix.

I'd lean toward the second — it preserves the speedup intent of the PR without making the cleanup poll do double duty as a slack budget for scheduler jitter.

`async_check_service` paid an inline `random.randint(150, 250)` ms wait
on every `register_service` — RFC 6762 §8.1 thundering-herd avoidance
for real networks, pure overhead on 127.0.0.1. Extracted as a module
constant `_REGISTER_RANDOM_INTERVAL` and patched in the `quick_timing`
fixture to (1, 5)ms alongside the existing _CHECK_TIME / _REGISTER_TIME
/ _UNREGISTER_TIME shaves.

Also:
- `test_shutdown_loop`: drop `.result(1)` → `.result(0.1)` and patch
  `_TASK_AWAIT_TIMEOUT` to 0.1s. The 1s waits were `shutdown_loop`'s
  outer timeout on a never-completing `asyncio.sleep(5)` — the test only
  needs the loop to be stoppable while a task is pending.
- `test_tc_bit_defers_last_response_missing`: patch
  `_TC_DELAY_RANDOM_INTERVAL` (400-500ms, RFC 6762 §7.2 TC-bit deferral)
  to (10, 20)ms and tighten the cleanup poll loop.
- `test_integration`: wait for the browser's first startup query to
  fire (with empty cache) before registering. Previously the test
  relied on the register's 150-250ms random wait being longer than the
  browser's 20-120ms first-query delay; with the new fast register that
  race flipped and the first query started seeing the known PTR via
  §7.1 known-answer suppression.

Speedups on loopback (CPython 3.12):

| test                                                  | before | after |
| ----------------------------------------------------- | ------ | ----- |
| test_shutdown_loop                                    | 1.16s  | 0.26s |
| test_tc_bit_defers_last_response_missing              | 0.79s  | 0.32s |
| test_async_service_registration_name_conflict        | 0.72s  | 0.10s |
| test_name_conflicts                                   | 0.50s  | 0.10s |

Plus a smaller transparent shave on every other `quick_timing`
register-based test. Full suite: 338 passed, 2 skipped. No production
behavior change — `_REGISTER_RANDOM_INTERVAL` still defaults to the
RFC 6762 §8.1 value on real networks.
@bluetoothbot
Copy link
Copy Markdown
Contributor Author

Rebase with requested adjustments

Branch koan/speed-up-register-random-wait was rebased onto master and review feedback was applied.

Changes applied

  • CI fix: Applied pre-push CI fix
  • CI fix: Pre-push CI fix applied

Stats

5 files changed, 23 insertions(+), 9 deletions(-)
Actions performed
  • Already-solved check: negative (confidence=high, reasoning=No commit on master extracts _REGISTER_RANDOM_INTERVAL or patches it via quick_timing; recent sp)
  • Rebased koan/speed-up-register-random-wait onto upstream/master
  • Pre-push CI check: previous run #26006589202 failed
  • Applied pre-push CI fix
  • Pre-push CI fix applied
  • Force-pushed koan/speed-up-register-random-wait to origin
  • CI check enqueued in ## CI (async)

CI status

CI will be checked asynchronously.


Automated by Kōan

@bluetoothbot bluetoothbot force-pushed the koan/speed-up-register-random-wait branch from ca16a07 to 6b84b72 Compare May 18, 2026 00:33
@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 19, 2026

@bluetoothbot rr

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

@bdraco: all the windows TC tests fail with this pr

The root cause is almost certainly Windows' default timer resolution. CPython's asyncio on Windows uses the system clock tick, which defaults to ~15.6ms (64Hz) unless something has called timeBeginPeriod(1). Patching _TC_DELAY_RANDOM_INTERVAL to (10, 20)ms means the lower bound (10ms) is below a single tick — the wait either returns immediately on the current tick boundary or has to wait a full ~15.6ms regardless, so the deferred response timing the test polls for becomes non-deterministic across the (10, 20)ms band.

Two viable fixes:

  1. Widen the patch to something safely above one Windows tick, e.g. (50, 100) or (100, 200). Still a meaningful shave vs. the production (400, 500)ms — the test would land around 0.4-0.5s instead of 0.32s, still well below the 0.79s baseline.

  2. Skip the patch on Windows via @pytest.mark.skipif(sys.platform == 'win32', ...) on just the patched variant, or branch the interval: (50, 100) if sys.platform == 'win32' else (10, 20).

Option 1 is the cleaner call — it keeps the test exercising the same code path on all platforms, just with a wider RFC 6762 §7.2 defer window that all OS timer resolutions can resolve. The poll loop's cleanup window would need to widen correspondingly to stay above the upper bound.

The diff snippet you pasted shows the patch is applied at function level — that's the right scope, just needs a Windows-safe value.

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 20, 2026

@bluetoothbot rr

2 similar comments
@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 20, 2026

@bluetoothbot rr

@bdraco
Copy link
Copy Markdown
Member

bdraco commented May 21, 2026

@bluetoothbot rr

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

Superseded — most of this PR's content (extracting _REGISTER_RANDOM_INTERVAL, patching it in quick_timing, the test_shutdown_loop cleanup) was landed in #1760 under the constant name _PROBE_RANDOM_DELAY_INTERVAL. Branch had diverged too far for a clean rebase. The one remaining novel bit — the got_query.wait() synchronisation in test_integration (needed now that register can finish before the browser's first query fires) — is being reopened as a fresh PR.

@bluetoothbot
Copy link
Copy Markdown
Contributor Author

Replacement opened: #1771 (test sync fragment only — everything else this PR proposed was already landed in #1760).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants