Investigate invariant failure when resuming during collection scan phase

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • Fully Compatible
    • Storage Execution 2026-06-08
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Steps to reproduce:
      1. Build mongod with disagg and debugging symbols

      bazel build --config=dbg --compress_debug_compile=False --build_atlas install-dist-test

      2. Start a 3-node disagg cluster w/ PDIB resumability enabled
      In tab 1, using Resmoke via Python - with (Internal link) log_config.py in ./mongo directory & virtualenv activated

      import log_config as lc
      from buildscripts.resmokelib.testing.fixtures import interface
      
      fixture_logger = lc.logging.loggers.new_fixture_logger("DisaggReplicaSetFixture", 0)
      
      fixture = lc.make_fixture("DisaggReplicaSetFixture", fixture_logger, 0, num_nodes=3, mongod_options={"set_parameters": {"enableTestCommands": 1, "writePeriodicNoops": 1, "featureFlagResumablePrimaryDrivenIndexBuilds": 1}})
      
      fixture.setup() # will take some time to start dependencies
      
      fixture.await_ready() 

      3. Load sample collection with significant amount of data
      In tab 2, with a local config.json file, generate and load data using mgodatagen (requires Go)

      go run github.com/feliixx/mgodatagen@latest -f ./config.json --uri "mongodb://localhost:20026/" --batchsize 1000

      4. Monitor mongod logs
      Reusing tab 2:

      tail -f fixture.log | grep --line-buffered '"t":' 

      5. Configure settings and failpoint to ensure index build will spill and will hang after collection scan phase
      In tab 3, in a mongo / mongosh shell connected to port 20026:

      db.adminCommand({setParameter: 1, maxIndexBuildMemoryUsageMegabytes: 50});
      db.adminCommand({configureFailPoint: "hangIndexBuildDuringBulkLoadPhase", mode: "alwaysOn", data: {iteration: NumberLong(0), indexNames: ["rawData_1"]}});

      6. Begin index build
      In tab 3, in a mongo / mongosh shell connected to port 20026:

      db.people.createIndexes([{name: 1}, {email: 1}, {rawData: 1}]);

      7. (After a few seconds) Stop primary node, while still in collection scan phase but after resume state has been written.

      fixture.nodes[0].mongod.stop(mode=interface.TeardownMode.KILL) 

      8. Observe segfault in fixture.log:
      Log sample:

       [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I",  "c":"STORAGE",  "id":12500800,"ctx":"IndexBuildsCoordinatorMongod-0","msg":"Index build: resuming index build from phase","attr":{"buildUUID":{"uuid":{"$uuid":"5224cebf-7363-4fab-bd80-0fcdc546d04d"}},"method":"Primary driven","protocol":"primary driven","phase":"collection scan"}}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I",  "c":"STORAGE",  "id":20650,   "ctx":"IndexBuildsCoordinator-StepUp","msg":"Active index builds","attr":{"context":"IndexBuildsCoordinator::_onStepUpAsyncTaskFn","builds":[{"buildUUID":{"uuid":{"$uuid":"5224cebf-7363-4fab-bd80-0fcdc546d04d"}},"collectionUUID":{"uuid":{"$uuid":"50f6639e-9bde-4e4a-b040-bcf750b3085c"}},"indexNames":["name_1","email_1","rawData_1"],"protocol":"single phase"}]}}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I",  "c":"STORAGE",  "id":7508300, "ctx":"IndexBuildsCoordinator-StepUp","msg":"Finished performing asynchronous step-up checks on index builds"}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.248+00:00"},"s":"W",  "c":"DISAGG",   "id":10985301,"ctx":"Disagg-7","msg":"Ignoring unknown materialized LSN, could be due to failover or an out-of-order materialized offset notification.","attr":{"lsn":7644335664191766662}}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F",  "c":"ASSERT",   "id":23079,   "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Invariant failure","attr":{"expr":"shard_role_details::getLocker(opCtx)->isWriteLocked()","location":"src/mongo/db/op_observer/op_observer_impl.cpp:2434:68:virtual void mongo::OpObserverImpl::onBatchedWriteCommit(OperationContext *, WriteUnitOfWork::OplogEntryGroupType, OpStateAccumulator *)"}}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F",  "c":"ASSERT",   "id":23080,   "ctx":"IndexBuildsCoordinatorMongod-0","msg":"\n\n***aborting after invariant() failure\n\n"}
      [Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F",  "c":"CONTROL",  "id":6384300, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

       
      If there is difficulty reproducing and the index build resumes successfully on sec0, try killing this node as well (I.e. repeat step 7 with fixture.nodes[1]) and seeing what happens on sec1.

            Assignee:
            Gregory Noma
            Reporter:
            Alex Sarkesian
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: