-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: 9.0.0-rc0
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
-
ClusterScalability 27Apr-11May
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
BF Summary
From BF-43019. The test has failed 9 times between 4/23 and 4/28.
This test is new and was introduced with SERVER-123403 in PR 51233 on 4/8.
Issue Overview
The test SkipsDuplicateOplogEntryOnRecoveryInStrictConsistency starts the change streams monitor (CSM) via awaitChangeStreamsMonitorStarted and waits for the _changeStreamsMonitorStarted future.
Then the test calls writeToCollection to perform 5 inserts, 1 update, and 2 deletes (overall document delta of 3).
While writeToCollection is proceeding, the recipient state machine continues forward and writes the ReshardDoneCatchUp oplog entry (all in the same millisecond):
[2026/04/27 19:57:20.674] {"t":{"$date":"2026-04-28T02:54:54.880+00:00"},"s":"I", "c":"RESHARD",
"id":9858301, "ctx":"ReshardingRecipientService-0","msg":"Starting the change streams
monitor","attr":{"reshardingUUID":{"uuid":{"$uuid":"48d0995d-d678-44f7-bda9-0b17a8dffb7f"}}}}
[2026/04/27 19:57:20.674] {"t":{"$date":"2026-04-28T02:54:54.883+00:00"},"s":"I", "c":"RESHARD", "id":12340303,"ctx":"ReshardingRecipientService-2","msg":"Successfully wrote ReshardDoneCatchUp oplog entry","attr":{"namespace":"sourcedb.sourcecollection","reshardingUUID":{"uuid":{"$uuid":"48d0995d-d678-44f7-bda9-0b17a8dffb7f"}},"opTime":{"ts":{"$timestamp":{"t":1777344894,"i":12}},"t":1}}}
[2026/04/27 19:57:20.674] src/mongo/db/s/resharding/resharding_recipient_service_test.cpp:969: Failure
This means the CSM completed before writeToCollection began doing updates, which causes the assertion failure:
[2026/04/27 19:57:20.674] Expected equality of these values: [2026/04/27 19:57:20.674] swDocumentsDelta.getValue() [2026/04/27 19:57:20.674] Which is: 0 [2026/04/27 19:57:20.674] documentsDelta [2026/04/27 19:57:20.674] Which is: 3
A delta of three documents was expected, but no writes landed before the CSM had already finished.
Reproduction
This issue is easily reproducible by adding a 100 ms sleep before writeToCollection, ensuring the CSM finishes before writeToCollection even begins.
...
if (!_noChunksToCopy) {
sleepmillis(100); // reproduce: let state machine race ahead
writeToCollection(opCtx, recipientDoc, _numInserts, _numDeletes, _numUpdates);
}
...
Fix
The solution for this race condition is to use the same state-transition guard that other unit tests in this file already use:
// Wait for state transition to ensure writeToCollection documents are written
stateTransitionsGuard.wait(RecipientStateEnum::kStrictConsistency);
stateTransitionsGuard.unset(RecipientStateEnum::kStrictConsistency);
Because the test thread runs sequentially, awaitChangeStreamsMonitorStarted (which performs writeToCollection) returns before the test thread reaches stateTransitionsGuard.unset. This forces the recipient to pause once it reaches kStrictConsistency and wait before writing the ReshardDoneCatchUp oplog entry, guaranteeing that all writes precede the terminal event the CSM observes.
Note that the documentsDelta calculation is performed when the CSM completes, so writeToCollection must finish before that point for the test to be deterministic.
Big Picture
The purpose of this unit test is to verify that ReshardDoneCatchUp behaves as expected when a stepdown interrupts the strict-consistency transition. This fix makes the test setup deterministic so the meaningful assertions at the end of the test are actually reached.
This failure occurs before the stepdown/stepup ever takes place.
- is related to
-
SERVER-123403 Ensure the final "ReshardDoneCatchUpChangeEventO2Field" oplog entry always recorded
-
- Closed
-