[SERVER-59586] Catchup takeover cannot be scheduled if primary has caught up but stuck before being writeable primary Created: 25/Aug/21 Updated: 06/Dec/22 Resolved: 30/Aug/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Wenbin Zhu | Assignee: | Backlog - Replication Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Replication
|
||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
Secondaries schedule catchup take over when they think that the primary's last applied opTime is behind itself. However if the primary is already caught up (either it caught up due to the normal catchup mechanism on stepup or it is already newer by the time it is elected), no secondary can schedule catchup takeover, therefore if the primary is stuck before it becomes a writeable primary (e.g. stuck in bumping config term), the whole systems freezes because no catchup takeover is scheduled. So it seems that we should relax the criteria of scheduling catchup takeover by depending on whether primary has written a new entry in the new term (which indicates that is has become writable), without requiring that primary is behind. |
| Comments |
| Comment by Wenbin Zhu [ 30/Aug/21 ] |
|
Superseded by PM-1039 |
| Comment by Wenbin Zhu [ 25/Aug/21 ] |
|
We might want to have a Storage Watchdog and integrate into PM-1039 as a more generic solution as pointed out by judah.schvimer |
| Comment by Wenbin Zhu [ 25/Aug/21 ] |
|
Yeah "improvement" makes sense to me. I will change the category to "improvement", thanks. As for the catchup timeout, I initially thought it would also help break out of drain mode and step down, but after checking the code, it seems only able to abort the catchup mode, so I guess that statement was incorrect, please ignore it. |
| Comment by Samyukta Lanka [ 25/Aug/21 ] |
|
I guess I'd argue that the primary being stuck is the issue, not that another node didn't perform a takeover. I think the latter is more undesirable behavior (or a potential improvement) rather than a bug. I'm not sure I understand your point about disabling catchup timeout though. What other cases are you thinking of that the catchup timeout handles? |
| Comment by Wenbin Zhu [ 25/Aug/21 ] |
|
samy.lanka Yeah if we go with the proposed solution, it sounds more like a feature request, but its original purpose is to solve a bug. We can change the ticket category once we decide on the solution. Also note that starting from introducing catchup takeover, we disabled catchup timeout on the primary by default, so this also seems kind of a bug if we define catchup takeover as only about taking over the primary that is catching up oplog for too long because catchup timeout is able to handle more cases. |
| Comment by Samyukta Lanka [ 25/Aug/21 ] |
|
It sounds like this is a proposal to change catchup takeover from being about primary catchup taking too long to being about a node taking too long to be able to start accepting writes after being elected (for whatever reason). In that sense, this feels like a new feature request rather than a bug. |