TTL delete blocked when BatchedDeleteStage stages orphans against the pass target (reproducer + fix sketch for SERVER-92779)

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Adds a jstest reproducer for SERVER-92779 plus a design doc sketching the two-line accounting fix in BatchedDeleteStage that lets the TTL monitor walk past a band of expired orphan documents.

      Files

      • jstests/sharding/ttl_blocked_by_unowned_docs.js (184 lines)
      • src/mongo/db/ttl/PROPOSED_FIX.md (130 lines, ~831 words)

      Reproducer flow

      2-shard ShardingTest, range deleter disabled, ttlIndexDeleteTargetDocs lowered to 100 so only ~150 orphans are needed (vs the 50,000 default). Inserts 150 docs at sk: 1..150 and one canary at sk: -1, splits at sk: 0, moves the high-side chunk to shard1 — leaving 150 expired orphans + 1 owned expired canary on shard0. After creating the TTL index, asserts (via assert.soon) that shard0's canary gets deleted. Today that hangs, so the test gates the final block on FIX_LANDED = false (skips with jsTest.log + quit()); also tagged __TEMPORARILY_DISABLED_PENDING_SERVER_92779.

      Root cause (proposed)

      BatchedDeleteStage::_passTotalDocsStaged is incremented in the staging loop for every working-set member appended to the buffer, before the commit phase consults the ownership filter and skips orphans (cpp lines 389-394, 516). Once a shard's expired-orphan count crosses targetPassDocs, the stage stages orphans, hits _passTargetMet() without committing a single delete, drains by skipping every staged orphan, and returns EOF — so the TTL monitor's next sub-pass starts from the same index position and reproduces the same stall.

      Proposed fix (two-layer, defense in depth)

      • Change A — in the orphan-skip branch of _deleteBatch, decrement _passTotalDocsStaged so the pass target only counts honest delete attempts. Honors the existing idl docstring ("limits the number of expired documents removed").
      • Change B — in TTLMonitor::_deleteExpiredWithIndex, treat examined > 0 && deleted == 0 && passTargetMet as "there is more to do" and return TRUE to schedule another sub-pass. Defense-in-depth against future commit-time skip paths.

      Both diff sketches are in PROPOSED_FIX.md.

      Tests

      • jstests/sharding/ttl_blocked_by_unowned_docs.js — added (set FIX_LANDED = true and drop the disable-tag once the fix lands).
      • Suggested follow-up unit test in batched_delete_stage_test.cpp that injects a mock ownership filter rejecting half of staged docs and asserts _passTargetMet() accounts only for owned stages.

      Risk

      Behavior change is confined to the orphan-skip path. Collections without orphans see no observable change. Collections with orphans: TTL now makes forward progress past the orphan band — the intended fix.

      Related

            Assignee:
            Unassigned
            Reporter:
            Mehar Grewal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: