-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Execution
-
Fully Compatible
-
Storage Execution 2026-06-08
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Steps to reproduce:
1. Build mongod with disagg and debugging symbols
bazel build --config=dbg --compress_debug_compile=False --build_atlas install-dist-test
2. Start a 3-node disagg cluster w/ PDIB resumability enabled
In tab 1, using Resmoke via Python - with (Internal link) log_config.py in ./mongo directory & virtualenv activated
import log_config as lc from buildscripts.resmokelib.testing.fixtures import interface fixture_logger = lc.logging.loggers.new_fixture_logger("DisaggReplicaSetFixture", 0) fixture = lc.make_fixture("DisaggReplicaSetFixture", fixture_logger, 0, num_nodes=3, mongod_options={"set_parameters": {"enableTestCommands": 1, "writePeriodicNoops": 1, "featureFlagResumablePrimaryDrivenIndexBuilds": 1}}) fixture.setup() # will take some time to start dependencies fixture.await_ready()
3. Load sample collection with significant amount of data
In tab 2, with a local config.json file, generate and load data using mgodatagen (requires Go)
go run github.com/feliixx/mgodatagen@latest -f ./config.json --uri "mongodb://localhost:20026/" --batchsize 1000
4. Monitor mongod logs
Reusing tab 2:
tail -f fixture.log | grep --line-buffered '"t":'
5. Configure settings and failpoint to ensure index build will spill and will hang after collection scan phase
In tab 3, in a mongo / mongosh shell connected to port 20026:
db.adminCommand({setParameter: 1, maxIndexBuildMemoryUsageMegabytes: 50});
db.adminCommand({configureFailPoint: "hangIndexBuildDuringBulkLoadPhase", mode: "alwaysOn", data: {iteration: NumberLong(0), indexNames: ["rawData_1"]}});
6. Begin index build
In tab 3, in a mongo / mongosh shell connected to port 20026:
db.people.createIndexes([{name: 1}, {email: 1}, {rawData: 1}]);
7. (After a few seconds) Stop primary node, while still in collection scan phase but after resume state has been written.
fixture.nodes[0].mongod.stop(mode=interface.TeardownMode.KILL)
8. Observe segfault in fixture.log:
Log sample:
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I", "c":"STORAGE", "id":12500800,"ctx":"IndexBuildsCoordinatorMongod-0","msg":"Index build: resuming index build from phase","attr":{"buildUUID":{"uuid":{"$uuid":"5224cebf-7363-4fab-bd80-0fcdc546d04d"}},"method":"Primary driven","protocol":"primary driven","phase":"collection scan"}}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I", "c":"STORAGE", "id":20650, "ctx":"IndexBuildsCoordinator-StepUp","msg":"Active index builds","attr":{"context":"IndexBuildsCoordinator::_onStepUpAsyncTaskFn","builds":[{"buildUUID":{"uuid":{"$uuid":"5224cebf-7363-4fab-bd80-0fcdc546d04d"}},"collectionUUID":{"uuid":{"$uuid":"50f6639e-9bde-4e4a-b040-bcf750b3085c"}},"indexNames":["name_1","email_1","rawData_1"],"protocol":"single phase"}]}}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.245+00:00"},"s":"I", "c":"STORAGE", "id":7508300, "ctx":"IndexBuildsCoordinator-StepUp","msg":"Finished performing asynchronous step-up checks on index builds"}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.248+00:00"},"s":"W", "c":"DISAGG", "id":10985301,"ctx":"Disagg-7","msg":"Ignoring unknown materialized LSN, could be due to failover or an out-of-order materialized offset notification.","attr":{"lsn":7644335664191766662}}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F", "c":"ASSERT", "id":23079, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Invariant failure","attr":{"expr":"shard_role_details::getLocker(opCtx)->isWriteLocked()","location":"src/mongo/db/op_observer/op_observer_impl.cpp:2434:68:virtual void mongo::OpObserverImpl::onBatchedWriteCommit(OperationContext *, WriteUnitOfWork::OplogEntryGroupType, OpStateAccumulator *)"}}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F", "c":"ASSERT", "id":23080, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"\n\n***aborting after invariant() failure\n\n"}
[Disagg:j0:sec1] {"t":{"$date":"2026-05-26T22:46:51.256+00:00"},"s":"F", "c":"CONTROL", "id":6384300, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
If there is difficulty reproducing and the index build resumes successfully on sec0, try killing this node as well (I.e. repeat step 7 with fixture.nodes[1]) and seeing what happens on sec1.
- is depended on by
-
SERVER-126904 Test failover before a PDIB checkpoint during scan phase
-
- Closed
-
- related to
-
SERVER-127858 Unit test primary driven index build resume utilities with OpObserverImpl
-
- In Code Review
-