If a member
- has not caught up to minvalid, and
- has any non-trivial amount of additional replication lag
then it can inadvertently delay its transition to SECONDARY until it has fully eliminated its replication lag. The expected behavior is for the member to transition to SECONDARY as soon as it catches up to minvalid.
The bug is caused by the interaction of two pieces of functionality in SyncTail::oplogApplication(): the batch size limit, and a timer that controls logic for attempting to transition to SECONDARY. The timer is reset when a new batch is started, and then the timer generates a once-per-second call to tryToGoLiveAsASecondary(); however, the first call happens only after the timer reaches t=1 second. This can delay the tryToGoLiveAsASecondary() call if batches take less than one second to process.
This can be triggered in the following example scenario:
- A secondary is shut down while in the middle of processing an oplog batch (any time between the minvalid write and the oplog write)
- The member is brought back into the replica set hours later
The member will call tryToGoLiveAsASecondary() when it first starts up, but the SECONDARY transition will not occur because the minvalid condition will fail (as expected). Then, it will start processing oplog entries. Since it has replication lag, it will fetch oplog entries as fast as possible, and take less than one second each time to hit the batch limit. It will thus never get to the next tryToGoLiveAsASecondary() call until it has fully caught up.
Note that if, in the above scenario, the secondary instead happened to have been shut down between batches, then it will transition to SECONDARY as soon as it is brought up (since the minvalid condition will succeed during the first tryToGoLiveAsASecondary() call).