Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Repl 2017-10-02, Repl 2017-10-23
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

There is a race in this test between the thread started in stepDown_nonBlocking() and the call to ReplicationCoordinator::stepDown():

https://github.com/mongodb/mongo/blob/7626535bbcc2f90b7815cbf1a8e6d2c0bef732f1/src/mongo/db/repl/replication_coordinator_impl_test.cpp#L2046

replication_coordinator_impl_test.cpp

TEST_F(StepDownTest, OnlyOneStepDownCmdIsAllowedAtATime) {
    OpTime optime1(Timestamp(100, 1), 1);
    OpTime optime2(Timestamp(100, 2), 1);

    // No secondary is caught up
    auto repl = getReplCoord();
    repl->setMyLastAppliedOpTime(optime2);
    repl->setMyLastDurableOpTime(optime2);
    ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 1, optime1));
    ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 2, optime1));

    simulateSuccessfulV1Election();

    ASSERT_TRUE(getReplCoord()->getMemberState().primary());

    // Step down where the secondary actually has to catch up before the stepDown can succeed.
    // On entering the network, _stepDownContinue should cancel the heartbeats scheduled for
    // T + 2 seconds and send out a new round of heartbeats immediately.
    // This makes it unnecessary to advance the clock after entering the network to process
    // the heartbeat requests.
    auto result = stepDown_nonBlocking(false, Seconds(10), Seconds(60));

    // Now while the first stepdown request is waiting for secondaries to catch up, attempt another
    // stepdown request and ensure it fails.
    const auto opCtx = makeOperationContext();
    auto status = getReplCoord()->stepDown(opCtx.get(), false, Seconds(10), Seconds(60));
    ASSERT_EQUALS(ErrorCodes::ConflictingOperationInProgress, status);

    // Now ensure that the original stepdown command can still succeed.
    catchUpSecondaries(optime2);

    ASSERT_OK(*result.second.get());
    ASSERT_TRUE(repl->getMemberState().secondary());
}

If the main test thread attempts to call stepDown() before the TopologyCoordinator enters the attempingToStepDown state, this test will block.

is related to

SERVER-28544 Stepdown command must take global lock in exclusive mode

Closed

SERVER-31341 Synchronize unit tests that wait for asynchronous stepdown attempts

Closed

Assignee:: Benety Goh
Reporter:: Benety Goh
Participants:: Benety Goh
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Sep 22 2017 05:42:08 PM UTC
Updated:: Oct 03 2017 02:01:58 PM UTC
Resolved:: Oct 03 2017 02:01:58 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates