Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.5.0
Affects Version/s: 2.4.3
Component/s: Replication
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If a member

has not caught up to minvalid, and
has any non-trivial amount of additional replication lag

then it can inadvertently delay its transition to SECONDARY until it has fully eliminated its replication lag. The expected behavior is for the member to transition to SECONDARY as soon as it catches up to minvalid.

The bug is caused by the interaction of two pieces of functionality in SyncTail::oplogApplication(): the batch size limit, and a timer that controls logic for attempting to transition to SECONDARY. The timer is reset when a new batch is started, and then the timer generates a once-per-second call to tryToGoLiveAsASecondary(); however, the first call happens only after the timer reaches t=1 second. This can delay the tryToGoLiveAsASecondary() call if batches take less than one second to process.

This can be triggered in the following example scenario:

A secondary is shut down while in the middle of processing an oplog batch (any time between the minvalid write and the oplog write)
The member is brought back into the replica set hours later

The member will call tryToGoLiveAsASecondary() when it first starts up, but the SECONDARY transition will not occur because the minvalid condition will fail (as expected). Then, it will start processing oplog entries. Since it has replication lag, it will fetch oplog entries as fast as possible, and take less than one second each time to hit the batch limit. It will thus never get to the next tryToGoLiveAsASecondary() call until it has fully caught up.

Note that if, in the above scenario, the secondary instead happened to have been shut down between batches, then it will transition to SECONDARY as soon as it is brought up (since the minvalid condition will succeed during the first tryToGoLiveAsASecondary() call).

Assignee:: Eric Milkie
Reporter:: J Rassi (Inactive)
Participants:: auto, Eric Milkie, J Rassi
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Apr 25 2013 07:47:53 PM UTC
Updated:: Jul 11 2016 05:39:08 PM UTC
Resolved:: May 08 2013 05:59:07 PM UTC

Details

Description

Attachments

Activity

People

Dates