[SERVER-36997] Stepdown thread should perform w:all write after stepping up and restarting node Created: 05/Sep/18 Updated: 27/Oct/23 Resolved: 18/Apr/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Backlog - Replication Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | tig-bfday-eligible, tig-nep, tig-resmoke | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | STM 2019-09-09, STM 2020-01-09 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 19 | ||||||||||||
| Story Points: | 3 | ||||||||||||
| Description |
|
After the stepdown thread steps up a node, any of the other nodes in the set may go into rollback. This rollback can happen at unpredictable times, since it is based on when the node gets a new batch and triggers the OplogStartMissing error. When rollback occurs, the node closes all of its connections, and the stepdown thread is not robust to connections closing at arbitrary points. The stepdown thread can wait for all rollbacks to complete by performing a w:"all" write after stepping up a node. |
| Comments |
| Comment by Robert Guo (Inactive) [ 11/Apr/22 ] |
|
judah.schvimer@mongodb.com I don't believe there's increased urgency. The BF has not occurred recently and may have been addressed elsewhere. |
| Comment by Iryna Zhuravlova [ 08/Apr/22 ] |
|
robert.guo is wondering if anyone on the Replication team is interesting in taking this? |
| Comment by Robert Guo (Inactive) [ 13/Nov/19 ] |
|
We should also do a w:all write after a restart so setFeatureCompatibility calls are propagated on all nodes. Otherwise a secondary will drop all connections when setFCV is called |
| Comment by Max Hirschhorn [ 13/Sep/18 ] |
|
In order to allow stepdowns and rollbacks to happen concurrently in sharded clusters on the CSRS and replica sets shard, we should do this in two phase - (1) for each replica set: step down/terminate/kill a node in the replica set and step a new node up, and (2) for each replica set: do a w=all write to wait for all rollbacks to have finished. |