[SERVER-25160] Drain and catchup modes shouldn't continue on stepdown Created: 19/Jul/16  Updated: 25/Jan/17  Resolved: 25/Aug/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.3.12

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-7200 use oplog as op buffer on secondaries Closed
Duplicate
is duplicated by SERVER-24545 Replicaset secondary abnormal exit. Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.2
Sprint: Repl 18 (08/05/16), Repl 2016-08-29
Participants:

 Description   

If the primary steps down in drain mode, it should stop the drain mode and clean up its state.



 Comments   
Comment by Githook User [ 25/Aug/16 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-25160 Unit test for stepping down in catch-up mode.
Branch: master
https://github.com/mongodb/mongo/commit/e589562b858061cf82dd430115c82033203db018

Comment by Siyuan Zhou [ 15/Aug/16 ]

Step-down could happen in two places: 1) when the primary is scanning freshness on nodes; 2) when the primary is trying to catch up. The first case has been covered and unit tested. The second case will be fixed as part of SERVER-7200 by redbeard0531.

The solution is to check primary-ship when finishing drain mode. In 3.3, step-down will signal the replication coordinator to finish catch-up, then the replication coordinator finishes catch-up and enters drain mode as normal. Bgsync sees the drain mode and puts a sentinel in the oplog buffer to let applier exit drain mode. Finally, applier calls signalDrainComplete() and notices it's no longer the primary, so it cleans up the states and stops transition to primary.

SERVER-7200 will also backport the behavior. I'll keep this ticket open to track the work of adding unit tests and backport after SERVER-7200 gets pushed.

Currently, when a primary steps down in drain mode, it will still try to finish the transition to primary and allow external writes. However, after the node writes the no-op "new primary" into oplog, bgsync will notice the diversity from its sync source and trigger rollback, disabling external writes, so the effect of this bug is limited.

Generated at Thu Feb 08 04:08:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.