Summary
Adds a jstest reproducer for SERVER-92779 plus a design doc sketching the two-line accounting fix in BatchedDeleteStage that lets the TTL monitor walk past a band of expired orphan documents.
Files
- jstests/sharding/ttl_blocked_by_unowned_docs.js (184 lines)
- src/mongo/db/ttl/PROPOSED_FIX.md (130 lines, ~831 words)
Reproducer flow
2-shard ShardingTest, range deleter disabled, ttlIndexDeleteTargetDocs lowered to 100 so only ~150 orphans are needed (vs the 50,000 default). Inserts 150 docs at sk: 1..150 and one canary at sk: -1, splits at sk: 0, moves the high-side chunk to shard1 — leaving 150 expired orphans + 1 owned expired canary on shard0. After creating the TTL index, asserts (via assert.soon) that shard0's canary gets deleted. Today that hangs, so the test gates the final block on FIX_LANDED = false (skips with jsTest.log + quit()); also tagged __TEMPORARILY_DISABLED_PENDING_SERVER_92779.
Root cause (proposed)
BatchedDeleteStage::_passTotalDocsStaged is incremented in the staging loop for every working-set member appended to the buffer, before the commit phase consults the ownership filter and skips orphans (cpp lines 389-394, 516). Once a shard's expired-orphan count crosses targetPassDocs, the stage stages orphans, hits _passTargetMet() without committing a single delete, drains by skipping every staged orphan, and returns EOF — so the TTL monitor's next sub-pass starts from the same index position and reproduces the same stall.
Proposed fix (two-layer, defense in depth)
- Change A — in the orphan-skip branch of _deleteBatch, decrement _passTotalDocsStaged so the pass target only counts honest delete attempts. Honors the existing idl docstring ("limits the number of expired documents removed").
- Change B — in TTLMonitor::_deleteExpiredWithIndex, treat examined > 0 && deleted == 0 && passTargetMet as "there is more to do" and return TRUE to schedule another sub-pass. Defense-in-depth against future commit-time skip paths.
Both diff sketches are in PROPOSED_FIX.md.
Tests
- jstests/sharding/ttl_blocked_by_unowned_docs.js — added (set FIX_LANDED = true and drop the disable-tag once the fix lands).
- Suggested follow-up unit test in batched_delete_stage_test.cpp that injects a mock ownership filter rejecting half of staged docs and asserts _passTargetMet() accounts only for owned stages.
Risk
Behavior change is confined to the orphan-skip path. Collections without orphans see no observable change. Collections with orphans: TTL now makes forward progress past the orphan band — the intended fix.
Related
- SERVER-92779 — TTL delete progress blocked by unowned documents
- SERVER-97661 / SERVER-97659 / SERVER-96179 — adjacent "audit/extend TTL tests" tickets
- is related to
-
SERVER-92779 TTL delete progress blocked by unowned documents
-
- Needs Scheduling
-
-
SERVER-96179 Further extension to ttl unit testing
-
- Blocked
-
-
SERVER-97659 Add TTL unit tests that require reproducing timing conditions
-
- Backlog
-
-
SERVER-97661 Audit js tests for TTL
-
- Backlog
-