-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Execution
-
Fully Compatible
-
ALL
-
Storage Execution 2026-07-06
-
200
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
On a resumed primary-driven index build (PDIB) with replicated container writes, the bulk-load path can re-emit a container-insert (ci) oplog op for index keys that are already present in the index container. An applier that already holds those keys fatally rejects the duplicate with WT_DUPLICATE_KEY (-31801) -> fassert 34437. This is the crash observed in BF-43992.
Root cause (confirmed in code)
BulkBuilderImpl::_addKeyForCommit (src/mongo/db/index/index_access_method.cpp:1363-1392):
- Starts with no _nonexistentKeyGuarantee -> ExistingKeyPolicy::reject: an already-present key returns KeyExists and is skipped without emitting an oplog op.
- After the first successful insert it latches the guarantee on (:1377-1380) -> ExistingKeyPolicy::overwrite: every subsequent key is written blindly and emits a ci op even if it already exists.
- The latch assumes already-present keys form a contiguous sorted prefix of the drained keys. True for a fresh build; unsound on a resume whose container holds a non-contiguous subset of the keys being drained.
- _nonexistentKeyGuarantee is not persisted/restored on resume (resume constructor :1003-1022); it always starts unset.
Producer/consumer asymmetry (crashes a follower, not the writer):
- The primary's own write path tolerates KeyExists (index_access_method.cpp:446-449).
- The applier does not: applyContainerOperations (src/mongo/db/repl/oplog.cpp) returns KeyExists, which the oplog applier turns into fassert 34437 (oplog_applier_utils.cpp:718, LOGV2 12337303 "Error applying grouped container operations").
Crash signature (from BF-43992 logs)
REPL 12337303 "Error applying grouped container operations" op:"ci", ns:"admin.$container", container:"index\-" ASSERT 23095 Fatal assertion 34437 KeyExists: \-31801: WT\_DUPLICATE\_KEY \(OplogApplier\), immediately after an "op":"n","msg":"new primary" entry
The duplicated keys decode to ordinary integer-field index data keys (KeyString ctype 0x2C, kNumericPositive2ByteInt), and there are multiple distinct duplicated keys – not one repeated key.
Relationship to
SERVER-127943
SERVER-127943 (Closed) is a sibling, not a fix for this. It fixed only the wildcard multikey metadata key form of this duplicate-ci family (by restoring _hasMultiKeyMetadataKeys on resume). Its fix (082c6350a5d) is present at the BF commit (66c084cd) – verified – yet BF-43992 still occurs because the duplicated keys here are ordinary data keys: a different leak of the same _addKeyForCommit latch. As that ticket noted, "the correct fix is the primary not emitting duplicate ci ops."
Why it is rare (confirmed in code)
Even when the primary re-emits a duplicate ci, the applier crashes only under a non-blind container write. Standbys sample blind writes with probability gWiredTigerBlindWriteRatio (default 0.999, wiredtiger_cursor_helpers.cpp); a blind/overwrite apply silently tolerates the duplicate. The fatal WT_DUPLICATE_KEY therefore requires the uncommon non-blind path.
Suggested fix direction
Stop the producer from emitting duplicate ci ops on resume. Minimal option: do not latch nonexistentKeyGuarantee for a resumed load (keep ExistingKeyPolicy::reject for the whole resumed drain; reject already skips present keys correctly and emits only genuinely-absent ones). Do _not "fix" it by tolerating KeyExists on the applier – that would mask genuine replication divergence (the same signature appears in real constraint-violation bugs) and would not stop the wasted oplog / replication churn.
Reproduction status
Not reproduced deterministically. The known reproducer is pali_chaos (disagg_pali_chaos), non-deterministic – the context BF-43992 was reported from. A deterministic repro appears to require both (a) a non-contiguous resumed container and (b) a non-blind apply; manufacturing (a) outside the disaggregated-storage materialization timing was not achieved.
- depends on
-
SERVER-129967 Add API to delete all index entries for PDIB
-
- Closed
-
- is related to
-
SERVER-127943 Resumable primary-driven build of a multikey wildcard index crashes the secondary with WT_DUPLICATE_KEY
-
- Closed
-