Resumable primary-driven build of a multikey wildcard index crashes the secondary with WT_DUPLICATE_KEY

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • Fully Compatible
    • ALL
    • Storage Execution 2026-06-08
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Overview

      A resumable primary-driven index build (PDIB) of a multikey wildcard index crashes a secondary node with a fatal WT_DUPLICATE_KEY assertion while it applies the build's container-insert oplog ops. Root cause is confirmed in code (below). Found while extending PrimaryDrivenResumableIndexBuildTest (SERVER-127680).

      Trigger conditions (confirmed)

      Suite no_passthrough_primary_driven_index_builds (enables featureFlagPrimaryDrivenIndexBuilds, featureFlagResumablePrimaryDrivenIndexBuilds, featureFlagContainerWrites). A single resumable build is sufficient: build a wildcard index over an array-valued field (so the wildcard is multikey) and resume it at the bulk-load phase, beginning position (load iteration 0 — the scan->bulk-load boundary), e.g. by pausing there and stepping up a secondary. Reproduced reliably with --jobs=1.

      • Required: the wildcard is multikey (array values) AND the resume happens at the load-phase beginning (iteration 0).
      • Not triggered by: scalar (non-multikey) wildcard; a plain multikey b-tree; a unique index; resuming at the load middle/end; or a non-resumed build.
      • (Correction: an earlier version of this description said "repeated build/drop/rebuild" was required — that was imprecise. A single resume at the load beginning suffices.)

      Crash

      "s":"F","c":"ASSERT","id":23095,"ctx":"OplogApplier-0","msg":"Fatal assertion",
       "attr":{"msgid":34437,"error":"KeyExists: -31801: WT_DUPLICATE_KEY: attempt to insert an existing key",
               "location":"src/mongo/db/repl/oplog_applier_impl.cpp:695"}
      

      Preceded by "Error applying grouped container operations" (id 12337303) for "op":"ci","ns":"admin.$container","container":"index-<uuid>". The duplicate key decodes to the wildcard multikey-metadata marker {{

      {"":1,"":"<path>"}

      }} + a reserved RecordId, with an empty value.

      Root cause (confirmed in code)

      The duplicated entry is the wildcard multikey metadata key. It is identical across every document whose indexed path holds an array, because metadata keys use a fixed reserved RecordId (makeMultikeyMetadataKey, src/mongo/db/index/wildcard_key_generator.cpp:421-444), whereas data keys / b-tree keys embed the document's distinct RecordId.

      • Normal build: BulkBuilderImpl::setIsMultikey() sets _hasMultiKeyMetadataKeys = true (src/mongo/db/index/index_access_method.cpp:1108) during the scan; commit() then deduplicates the (sorted-adjacent, identical) metadata keys at index_access_method.cpp:1323 (if (_hasMultiKeyMetadataKeys && data.first.compare(_previousKey) == 0) continue;), so exactly one ci op is emitted.
      • Resume: the resume BulkBuilderImpl constructor (index_access_method.cpp:1066-1083) restores _keysInserted, _isMultiKey, _indexMultikeyPaths, and the sorter ranges, but does NOT restore _hasMultiKeyMetadataKeys (it defaults to false at line 1034). The load-phase resume re-enters commit() without the scan's setIsMultikey(), so the flag stays false, the dedup at line 1323 is skipped, and every identical metadata key reaches _addKeyForCommit. After the first insert the policy flips reject->overwrite (line 1435-1439), so each subsequent identical key is re-inserted and emits its own ci op (container_write::insert emits the oplog op on every successful insert).
      • The oplog then carries many duplicate ci ops for the same metadata key; the secondary applies them and a duplicate container insert raises WT_DUPLICATE_KEY -> fassert 34437 (src/mongo/db/repl/oplog_applier_impl.cpp:695).

      Why only these conditions

      • Only multikey wildcards: only wildcard indexes emit multikey metadata keys (shared reserved RecordId) -> only they produce cross-document identical keys that rely on the _hasMultiKeyMetadataKeys dedup. Plain multikey b-trees embed distinct RecordIds; scalar wildcards emit no metadata keys.
      • Only the load beginning: the metadata key sorts first (its leading element is the number 1, which sorts before the data keys' string path), so only an iteration-0 resume re-drains it with the dedup disabled. Mid/end resumes start past it — the prior primary already emitted it once (deduped) before stepping down.

      Suggested fix

      Restore _hasMultiKeyMetadataKeys on resume: persist it in IndexStateInfo and set it in the resume constructor, or derive it (wildcard index AND _isMultiKey, which is already restored at line 1078). That re-enables the dedup so the resumed path emits a single ci op, matching the non-resumed path. (Making the secondary's container apply idempotent would mask the symptom, but the correct fix is the primary not emitting duplicate ci ops.)

      Notes

      • Priority Major - P3 since the feature is flag-gated off in production; consider P2 if release-blocking.
      • Repro: the diversified PrimaryDrivenResumableIndexBuildTest (resumable_load_phase.js and the multi-phase variants) on branch for SERVER-127680.

            Assignee:
            Gregory Noma
            Reporter:
            Gregory Noma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: