Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 9.0.0-rc1
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Execution
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Storage Execution 2026-07-06
Linked BF Score:
200
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary

On a resumed primary-driven index build (PDIB) with replicated container writes, the bulk-load path can re-emit a container-insert (ci) oplog op for index keys that are already present in the index container. An applier that already holds those keys fatally rejects the duplicate with WT_DUPLICATE_KEY (-31801) -> fassert 34437. This is the crash observed in BF-43992.

Root cause (confirmed in code)

BulkBuilderImpl::_addKeyForCommit (src/mongo/db/index/index_access_method.cpp:1363-1392):

Starts with no _nonexistentKeyGuarantee -> ExistingKeyPolicy::reject: an already-present key returns KeyExists and is skipped without emitting an oplog op.
After the first successful insert it latches the guarantee on (:1377-1380) -> ExistingKeyPolicy::overwrite: every subsequent key is written blindly and emits a ci op even if it already exists.
The latch assumes already-present keys form a contiguous sorted prefix of the drained keys. True for a fresh build; unsound on a resume whose container holds a non-contiguous subset of the keys being drained.
_nonexistentKeyGuarantee is not persisted/restored on resume (resume constructor :1003-1022); it always starts unset.

Producer/consumer asymmetry (crashes a follower, not the writer):

The primary's own write path tolerates KeyExists (index_access_method.cpp:446-449).
The applier does not: applyContainerOperations (src/mongo/db/repl/oplog.cpp) returns KeyExists, which the oplog applier turns into fassert 34437 (oplog_applier_utils.cpp:718, LOGV2 12337303 "Error applying grouped container operations").
Crash signature (from BF-43992 logs)
```
REPL  12337303  "Error applying grouped container operations"
                op:"ci", ns:"admin.$container", container:"index\-"
ASSERT 23095    Fatal assertion 34437  KeyExists: \-31801: WT\_DUPLICATE\_KEY
                $OplogApplier$, immediately after an "op":"n","msg":"new primary" entry
```
The duplicated keys decode to ordinary integer-field index data keys (KeyString ctype 0x2C, kNumericPositive2ByteInt), and there are multiple distinct duplicated keys – not one repeated key.

Relationship to ~~SERVER-127943~~

~~SERVER-127943~~ (Closed) is a sibling, not a fix for this. It fixed only the wildcard multikey metadata key form of this duplicate-ci family (by restoring _hasMultiKeyMetadataKeys on resume). Its fix (082c6350a5d) is present at the BF commit (66c084cd) – verified – yet BF-43992 still occurs because the duplicated keys here are ordinary data keys: a different leak of the same _addKeyForCommit latch. As that ticket noted, "the correct fix is the primary not emitting duplicate ci ops."

Why it is rare (confirmed in code)

Even when the primary re-emits a duplicate ci, the applier crashes only under a non-blind container write. Standbys sample blind writes with probability gWiredTigerBlindWriteRatio (default 0.999, wiredtiger_cursor_helpers.cpp); a blind/overwrite apply silently tolerates the duplicate. The fatal WT_DUPLICATE_KEY therefore requires the uncommon non-blind path.

Suggested fix direction

Stop the producer from emitting duplicate ci ops on resume. Minimal option: do not latch nonexistentKeyGuarantee for a resumed load (keep ExistingKeyPolicy::reject for the whole resumed drain; reject already skips present keys correctly and emits only genuinely-absent ones). Do _not "fix" it by tolerating KeyExists on the applier – that would mask genuine replication divergence (the same signature appears in real constraint-violation bugs) and would not stop the wasted oplog / replication churn.

Reproduction status

Not reproduced deterministically. The known reproducer is pali_chaos (disagg_pali_chaos), non-deterministic – the context BF-43992 was reported from. A deterministic repro appears to require both (a) a non-contiguous resumed container and (b) a non-blind apply; manufacturing (a) outside the disaggregated-storage materialization timing was not achieved.

depends on

SERVER-129967 Add API to delete all index entries for PDIB

Closed

is related to

SERVER-127943 Resumable primary-driven build of a multikey wildcard index crashes the secondary with WT_DUPLICATE_KEY

Closed

related to

SERVER-130647 Utilize batched container writes in index build bulk load

Closed

Assignee:: Gregory Noma
Reporter:: Alex Sarkesian
Participants:: Alex Sarkesian, Githook User, Gregory Noma
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 24 2026 07:31:36 PM UTC
Updated:: Jul 14 2026 07:08:15 AM UTC
Resolved:: Jul 01 2026 04:28:04 PM UTC

Fix NonexistentKeyGuarantee on resumed PDIB

Summary

Root cause (confirmed in code)

Crash signature (from BF-43992 logs)

Relationship to SERVER-127943

Why it is rare (confirmed in code)

Suggested fix direction

Reproduction status

Details

Description

Summary

Root cause (confirmed in code)

Crash signature (from BF-43992 logs)

Relationship to SERVER-127943

Why it is rare (confirmed in code)

Suggested fix direction

Reproduction status

Attachments

Issue Links

Activity

People

Dates

PagerDuty