Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31223

fix race in StepDownTest::OnlyOneStepDownCmdIsAllowedAtATime

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: Major - P3 Major - P3
    • None
    • None
    • None
    • None
    • Fully Compatible
    • ALL
    • Repl 2017-10-02, Repl 2017-10-23
    • 0

    Description

      There is a race in this test between the thread started in stepDown_nonBlocking() and the call to ReplicationCoordinator::stepDown():

      https://github.com/mongodb/mongo/blob/7626535bbcc2f90b7815cbf1a8e6d2c0bef732f1/src/mongo/db/repl/replication_coordinator_impl_test.cpp#L2046

      replication_coordinator_impl_test.cpp

      2021
      TEST_F(StepDownTest, OnlyOneStepDownCmdIsAllowedAtATime) {
      2022
          OpTime optime1(Timestamp(100, 1), 1);
      2023
          OpTime optime2(Timestamp(100, 2), 1);
      2024
       
      2025
          // No secondary is caught up
      2026
          auto repl = getReplCoord();
      2027
          repl->setMyLastAppliedOpTime(optime2);
      2028
          repl->setMyLastDurableOpTime(optime2);
      2029
          ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 1, optime1));
      2030
          ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 2, optime1));
      2031
       
      2032
          simulateSuccessfulV1Election();
      2033
       
      2034
          ASSERT_TRUE(getReplCoord()->getMemberState().primary());
      2035
       
      2036
          // Step down where the secondary actually has to catch up before the stepDown can succeed.
      2037
          // On entering the network, _stepDownContinue should cancel the heartbeats scheduled for
      2038
          // T + 2 seconds and send out a new round of heartbeats immediately.
      2039
          // This makes it unnecessary to advance the clock after entering the network to process
      2040
          // the heartbeat requests.
      2041
          auto result = stepDown_nonBlocking(false, Seconds(10), Seconds(60));
      2042
       
      2043
          // Now while the first stepdown request is waiting for secondaries to catch up, attempt another
      2044
          // stepdown request and ensure it fails.
      2045
          const auto opCtx = makeOperationContext();
      2046
          auto status = getReplCoord()->stepDown(opCtx.get(), false, Seconds(10), Seconds(60));
      2047
          ASSERT_EQUALS(ErrorCodes::ConflictingOperationInProgress, status);
      2048
       
      2049
          // Now ensure that the original stepdown command can still succeed.
      2050
          catchUpSecondaries(optime2);
      2051
       
      2052
          ASSERT_OK(*result.second.get());
      2053
          ASSERT_TRUE(repl->getMemberState().secondary());
      2054
      }
      

      If the main test thread attempts to call stepDown() before the TopologyCoordinator enters the attempingToStepDown state, this test will block.

      Attachments

        Activity

          People

            benety.goh@mongodb.com Benety Goh
            benety.goh@mongodb.com Benety Goh
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: