Fix NonexistentKeyGuarantee on resumed PDIB

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • Fully Compatible
    • ALL
    • Storage Execution 2026-07-06
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      On a resumed primary-driven index build (PDIB) with replicated container writes, the bulk-load path can re-emit a container-insert (ci) oplog op for index keys that are already present in the index container. An applier that already holds those keys fatally rejects the duplicate with WT_DUPLICATE_KEY (-31801) -> fassert 34437. This is the crash observed in BF-43992.

      Root cause (confirmed in code)

      BulkBuilderImpl::_addKeyForCommit (src/mongo/db/index/index_access_method.cpp:1363-1392):

      • Starts with no _nonexistentKeyGuarantee -> ExistingKeyPolicy::reject: an already-present key returns KeyExists and is skipped without emitting an oplog op.
      • After the first successful insert it latches the guarantee on (:1377-1380) -> ExistingKeyPolicy::overwrite: every subsequent key is written blindly and emits a ci op even if it already exists.
      • The latch assumes already-present keys form a contiguous sorted prefix of the drained keys. True for a fresh build; unsound on a resume whose container holds a non-contiguous subset of the keys being drained.
      • _nonexistentKeyGuarantee is not persisted/restored on resume (resume constructor :1003-1022); it always starts unset.

      Producer/consumer asymmetry (crashes a follower, not the writer):

      • The primary's own write path tolerates KeyExists (index_access_method.cpp:446-449).
      • The applier does not: applyContainerOperations (src/mongo/db/repl/oplog.cpp) returns KeyExists, which the oplog applier turns into fassert 34437 (oplog_applier_utils.cpp:718, LOGV2 12337303 "Error applying grouped container operations").

        Crash signature (from BF-43992 logs)

        REPL  12337303  "Error applying grouped container operations"
                        op:"ci", ns:"admin.$container", container:"index\-"
        ASSERT 23095    Fatal assertion 34437  KeyExists: \-31801: WT\_DUPLICATE\_KEY
                        \(OplogApplier\), immediately after an "op":"n","msg":"new primary" entry
        

        The duplicated keys decode to ordinary integer-field index data keys (KeyString ctype 0x2C, kNumericPositive2ByteInt), and there are multiple distinct duplicated keys – not one repeated key.

        Relationship to SERVER-127943

      SERVER-127943 (Closed) is a sibling, not a fix for this. It fixed only the wildcard multikey metadata key form of this duplicate-ci family (by restoring _hasMultiKeyMetadataKeys on resume). Its fix (082c6350a5d) is present at the BF commit (66c084cd) – verified – yet BF-43992 still occurs because the duplicated keys here are ordinary data keys: a different leak of the same _addKeyForCommit latch. As that ticket noted, "the correct fix is the primary not emitting duplicate ci ops."

      Why it is rare (confirmed in code)

      Even when the primary re-emits a duplicate ci, the applier crashes only under a non-blind container write. Standbys sample blind writes with probability gWiredTigerBlindWriteRatio (default 0.999, wiredtiger_cursor_helpers.cpp); a blind/overwrite apply silently tolerates the duplicate. The fatal WT_DUPLICATE_KEY therefore requires the uncommon non-blind path.

      Suggested fix direction

      Stop the producer from emitting duplicate ci ops on resume. Minimal option: do not latch nonexistentKeyGuarantee for a resumed load (keep ExistingKeyPolicy::reject for the whole resumed drain; reject already skips present keys correctly and emits only genuinely-absent ones). Do _not "fix" it by tolerating KeyExists on the applier – that would mask genuine replication divergence (the same signature appears in real constraint-violation bugs) and would not stop the wasted oplog / replication churn.

      Reproduction status

      Not reproduced deterministically. The known reproducer is pali_chaos (disagg_pali_chaos), non-deterministic – the context BF-43992 was reported from. A deterministic repro appears to require both (a) a non-contiguous resumed container and (b) a non-blind apply; manufacturing (a) outside the disaggregated-storage materialization timing was not achieved.

            Assignee:
            Gregory Noma
            Reporter:
            Alex Sarkesian
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: