[SERVER-36997] Stepdown thread should perform w:all write after stepping up and restarting node Created: 05/Sep/18  Updated: 27/Oct/23  Resolved: 18/Apr/22

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Backlog - Replication Team
Resolution: Gone away Votes: 0
Labels: tig-bfday-eligible, tig-nep, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-36960 Stepdown thread should handle AutoRec... Closed
Assigned Teams:
Replication
Operating System: ALL
Sprint: STM 2019-09-09, STM 2020-01-09
Participants:
Linked BF Score: 19
Story Points: 3

 Description   

After the stepdown thread steps up a node, any of the other nodes in the set may go into rollback. This rollback can happen at unpredictable times, since it is based on when the node gets a new batch and triggers the OplogStartMissing error. When rollback occurs, the node closes all of its connections, and the stepdown thread is not robust to connections closing at arbitrary points. The stepdown thread can wait for all rollbacks to complete by performing a w:"all" write after stepping up a node.



 Comments   
Comment by Robert Guo (Inactive) [ 11/Apr/22 ]

judah.schvimer@mongodb.com I don't believe there's increased urgency. The BF has not occurred recently and may have been addressed elsewhere.

Comment by Iryna Zhuravlova [ 08/Apr/22 ]

robert.guo is wondering if anyone on the Replication team is interesting in taking this? 

Comment by Robert Guo (Inactive) [ 13/Nov/19 ]

We should also do a w:all write after a restart so setFeatureCompatibility calls are propagated on all nodes. Otherwise a secondary will drop all connections when setFCV is called

Comment by Max Hirschhorn [ 13/Sep/18 ]

In order to allow stepdowns and rollbacks to happen concurrently in sharded clusters on the CSRS and replica sets shard, we should do this in two phase - (1) for each replica set: step down/terminate/kill a node in the replica set and step a new node up, and (2) for each replica set: do a w=all write to wait for all rollbacks to have finished.

Generated at Thu Feb 08 04:44:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.