[SERVER-81115] ReplicaSetAwareService Can Be Shutdown While Node is Still Primary Created: 15/Sep/23 Updated: 10/Nov/23 |
|
| Status: | Investigating |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Brett Nawrocki | Assignee: | Lingzhi Deng |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||
| Sprint: | Repl 2023-10-02, Repl 2023-10-16, Repl 2023-10-30 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
On stepdown, ReplicaSetAwareServices will be notified of the stepdown as part of stepdown actions only after the node is no longer writable. On shutdown, in most cases this property will hold, as prior to shutting down the replication coordinator (and therefore shutting down ReplicaSetAwareServices), we will first trigger a stepdown. However, in rare cases, it can be possible for the node to remain primary even after ReplicaSetAwareServices are shut down if the stepdown attempt during shutdown fails (for example, because no secondaries are caught up at the time of the shutdown). The stepdown is able to fail because, despite the forceShutdown parameter perhaps suggesting otherwise, the stepdown attempt is not forced. Instead, the forceShutdown parameter only determines whether we return the actual error if the stepdown attempt does fail, or if we swallow it and return OK anyway. It should also be noted that the caller invariants that stepDownForShutdown returns OK, which is to say that we aren't going to take any meaningful action in the event that the stepdown attempt fails (e.g. by deciding to abort the shutdown). The ultimate consequence of this is that PrimaryOnlyServices (built on top of ReplicaSetAwareService) needed to expose its shutdown state (see This ticket exists to answer a few questions:
|