[SERVER-27120] Increase synchronization between producer/applier threads and stepdown/stepup Created: 18/Nov/16  Updated: 31/Oct/19  Resolved: 28/Feb/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.4, 3.5.4

Type: Task Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: bkp, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-26970 isMaster can return isMaster: true wh... Closed
Related
related to SERVER-28272 extend timeout in step_down_during_dr... Closed
related to SERVER-42555 Complete TODO listed in SERVER-27120 Closed
related to SERVER-43451 Complete TODO listed in SERVER-27120 Closed
related to SERVER-44204 Complete TODO listed in SERVER-27120 Closed
related to SERVER-29222 Remove sentinel handling in OplogBuff... Backlog
is related to SERVER-24536 nodes can run an election while stepp... Closed
is related to SERVER-26403 Clean primary states on stepdown Closed
is related to SERVER-28181 Deadlock involving the mutexes of opl... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4, v3.2, v3.0
Sprint: Repl 2016-12-12, Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06
Participants:
Linked BF Score: 0

 Comments   
Comment by Githook User [ 31/Oct/19 ]

Author:

{'name': 'Siyuan Zhou', 'username': 'visualzhou', 'email': 'siyuan.zhou@mongodb.com'}

Message: SERVER-42555 Remove out-of-date comments of SERVER-27120.
Branch: master
https://github.com/mongodb/mongo/commit/acfe11e80f5d8b3988802a9dfcd08f913a717dac

Comment by Githook User [ 04/Apr/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-27120 Fix js test for no journal builds.

(cherry picked from commit 4403f1b1b503e67058ef7425d7e278be1d5b3e84)
Branch: v3.4
https://github.com/mongodb/mongo/commit/74cebe59bfbac098d898add311ddbaa9829de8d3

Comment by Githook User [ 04/Apr/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-27120 Increase synchronization between producer/applier threads and stepdown/stepup

SERVER-27913 Make sure the last applied hash is corresponding to the last applied optime in bgsync start()

(cherry picked from commit 1da3111dc238698e4e70672b7ba260a368121e50)
Branch: v3.4
https://github.com/mongodb/mongo/commit/95c5e610ed9922a5d517c2f0109b1600096e1af2

Comment by Siyuan Zhou [ 03/Mar/17 ]

This issue will not cause data corruption on 3.2, so we decided not to backport to it.

I backported StepUp command to 3.2 locally and modified the test file to run the scenario described in Spencer's first comment. After step 5, step 6 won't happen because the last oplog entry is the sentinel, bgsync refuses to sync anything until it exits drain mode in 3.2. As a result, the node will be spinning around line 239.

3.2 won't run into data corruption as we thought. This patch is more about a clean-up on master (and 3.4) to fix inconsistent states when stepping down during draining.

Comment by Githook User [ 03/Mar/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-27120 Fix js test for no journal builds.
Branch: master
https://github.com/mongodb/mongo/commit/4403f1b1b503e67058ef7425d7e278be1d5b3e84

Comment by Githook User [ 28/Feb/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-27120 Increase synchronization between producer/applier threads and stepdown/stepup

SERVER-27913 Make sure the last applied hash is corresponding to the last applied optime in bgsync start()
Branch: master
https://github.com/mongodb/mongo/commit/1da3111dc238698e4e70672b7ba260a368121e50

Comment by Spencer Brody (Inactive) [ 18/Nov/16 ]

Minimal fix is to refuse to run for election if _isWaitingForApplierToDrain or _isCatchingUp is true.
Additionally we should try to signal to the producer to stop producing new ops on stepdown, to expedite the applier finishing the current batch and signaling drain complete.

Generated at Thu Feb 08 04:14:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.