[SERVER-31223] fix race in StepDownTest::OnlyOneStepDownCmdIsAllowedAtATime Created: 22/Sep/17  Updated: 03/Oct/17  Resolved: 03/Oct/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Benety Goh
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-28544 Stepdown command must take global loc... Closed
is related to SERVER-31341 Synchronize unit tests that wait for ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2017-10-02, Repl 2017-10-23
Participants:
Linked BF Score: 0

 Description   

There is a race in this test between the thread started in stepDown_nonBlocking() and the call to ReplicationCoordinator::stepDown():

https://github.com/mongodb/mongo/blob/7626535bbcc2f90b7815cbf1a8e6d2c0bef732f1/src/mongo/db/repl/replication_coordinator_impl_test.cpp#L2046

replication_coordinator_impl_test.cpp

2021
TEST_F(StepDownTest, OnlyOneStepDownCmdIsAllowedAtATime) {
2022
    OpTime optime1(Timestamp(100, 1), 1);
2023
    OpTime optime2(Timestamp(100, 2), 1);
2024
 
2025
    // No secondary is caught up
2026
    auto repl = getReplCoord();
2027
    repl->setMyLastAppliedOpTime(optime2);
2028
    repl->setMyLastDurableOpTime(optime2);
2029
    ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 1, optime1));
2030
    ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 2, optime1));
2031
 
2032
    simulateSuccessfulV1Election();
2033
 
2034
    ASSERT_TRUE(getReplCoord()->getMemberState().primary());
2035
 
2036
    // Step down where the secondary actually has to catch up before the stepDown can succeed.
2037
    // On entering the network, _stepDownContinue should cancel the heartbeats scheduled for
2038
    // T + 2 seconds and send out a new round of heartbeats immediately.
2039
    // This makes it unnecessary to advance the clock after entering the network to process
2040
    // the heartbeat requests.
2041
    auto result = stepDown_nonBlocking(false, Seconds(10), Seconds(60));
2042
 
2043
    // Now while the first stepdown request is waiting for secondaries to catch up, attempt another
2044
    // stepdown request and ensure it fails.
2045
    const auto opCtx = makeOperationContext();
2046
    auto status = getReplCoord()->stepDown(opCtx.get(), false, Seconds(10), Seconds(60));
2047
    ASSERT_EQUALS(ErrorCodes::ConflictingOperationInProgress, status);
2048
 
2049
    // Now ensure that the original stepdown command can still succeed.
2050
    catchUpSecondaries(optime2);
2051
 
2052
    ASSERT_OK(*result.second.get());
2053
    ASSERT_TRUE(repl->getMemberState().secondary());
2054
}

If the main test thread attempts to call stepDown() before the TopologyCoordinator enters the attempingToStepDown state, this test will block.



 Comments   
Comment by Benety Goh [ 03/Oct/17 ]

test case disabled. Test will be fixed in SERVER-31341

Comment by Benety Goh [ 03/Oct/17 ]

Test disabled in this commit:

Author:

{'email': 'spencer@mongodb.com', 'name': 'Spencer T Brody', 'username': 'stbrody'}

Message: SERVER-31341 Temporarily disable hanging unit test
Branch: master
https://github.com/mongodb/mongo/commit/d0905224d451af6b897a4e8fd1ae77bcf66ae6de

Comment by Benety Goh [ 22/Sep/17 ]

Alternatively, we can remove this test case if there's coverage in topology_coordinator_impl_v1_test.cpp

Generated at Thu Feb 08 04:26:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.