Unit test SkipsDuplicateOplogEntryOnRecoveryInStrictConsistency fails sporadically

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: 9.0.0-rc0
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • Hide

      The issue relies on a race condition with specific timing. 

      To deterministically reproduce the issue, a sleepmillis(100) line can be added immediately before writeToCollection. This causes the recipient state machine to advance all the way to completion before the writeToCollection call even occurs.

      Show
      The issue relies on a race condition with specific timing.  To deterministically reproduce the issue, a sleepmillis(100) line can be added immediately before writeToCollection . This causes the recipient state machine to advance all the way to completion before the writeToCollection call even occurs.
    • ClusterScalability 27Apr-11May
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      BF Summary

      From BF-43019. The test has failed 9 times between 4/23 and 4/28.

      This test is new and was introduced with SERVER-123403 in PR 51233 on 4/8.

      Issue Overview

      The test SkipsDuplicateOplogEntryOnRecoveryInStrictConsistency starts the change streams monitor (CSM) via awaitChangeStreamsMonitorStarted and waits for the _changeStreamsMonitorStarted future.

      Then the test calls writeToCollection to perform 5 inserts, 1 update, and 2 deletes (overall document delta of 3).

      While writeToCollection is proceeding, the recipient state machine continues forward and writes the ReshardDoneCatchUp oplog entry (all in the same millisecond):

        [2026/04/27 19:57:20.674] {"t":{"$date":"2026-04-28T02:54:54.880+00:00"},"s":"I", "c":"RESHARD",
        "id":9858301, "ctx":"ReshardingRecipientService-0","msg":"Starting the change streams
        monitor","attr":{"reshardingUUID":{"uuid":{"$uuid":"48d0995d-d678-44f7-bda9-0b17a8dffb7f"}}}}
      
        [2026/04/27 19:57:20.674] {"t":{"$date":"2026-04-28T02:54:54.883+00:00"},"s":"I", "c":"RESHARD", "id":12340303,"ctx":"ReshardingRecipientService-2","msg":"Successfully wrote ReshardDoneCatchUp oplog entry","attr":{"namespace":"sourcedb.sourcecollection","reshardingUUID":{"uuid":{"$uuid":"48d0995d-d678-44f7-bda9-0b17a8dffb7f"}},"opTime":{"ts":{"$timestamp":{"t":1777344894,"i":12}},"t":1}}}
      
        [2026/04/27 19:57:20.674] src/mongo/db/s/resharding/resharding_recipient_service_test.cpp:969: Failure
        

      This means the CSM completed before writeToCollection began doing updates, which causes the assertion failure:

        [2026/04/27 19:57:20.674] Expected equality of these values:
        [2026/04/27 19:57:20.674]   swDocumentsDelta.getValue()
        [2026/04/27 19:57:20.674]     Which is: 0
        [2026/04/27 19:57:20.674]   documentsDelta
        [2026/04/27 19:57:20.674]     Which is: 3
        

      A delta of three documents was expected, but no writes landed before the CSM had already finished.

      Reproduction

      This issue is easily reproducible by adding a 100 ms sleep before writeToCollection, ensuring the CSM finishes before writeToCollection even begins.

        ...
                    if (!_noChunksToCopy) {
                        sleepmillis(100);  // reproduce: let state machine race ahead
                        writeToCollection(opCtx, recipientDoc, _numInserts, _numDeletes, _numUpdates);
                    }
        ...
        

      Fix

      The solution for this race condition is to use the same state-transition guard that other unit tests in this file already use:

            // Wait for state transition to ensure writeToCollection documents are written
            stateTransitionsGuard.wait(RecipientStateEnum::kStrictConsistency);
            stateTransitionsGuard.unset(RecipientStateEnum::kStrictConsistency);
        

      Because the test thread runs sequentially, awaitChangeStreamsMonitorStarted (which performs writeToCollection) returns before the test thread reaches stateTransitionsGuard.unset. This forces the recipient to pause once it reaches kStrictConsistency and wait before writing the ReshardDoneCatchUp oplog entry, guaranteeing that all writes precede the terminal event the CSM observes.

      Note that the documentsDelta calculation is performed when the CSM completes, so writeToCollection must finish before that point for the test to be deterministic.

      Big Picture

      The purpose of this unit test is to verify that ReshardDoneCatchUp behaves as expected when a stepdown interrupts the strict-consistency transition. This fix makes the test setup deterministic so the meaningful assertions at the end of the test are actually reached.

      This failure occurs before the stepdown/stepup ever takes place.

        1. BF-43019_parsley.txt
          4.55 MB
          Jordan Glassley

            Assignee:
            Jordan Glassley
            Reporter:
            Jordan Glassley
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: