[SERVER-50140] Initial sync cannot survive unclean restart of the sync source Created: 06/Aug/20 Updated: 29/Oct/23 Resolved: 12/Aug/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.4.0 |
| Fix Version/s: | 4.4.1, 4.7.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Suganthi Mani | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KP44 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||
| Sprint: | Repl 2020-08-24 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Investigate the behavior of resumable initial sync when sync source restarts after unclean shutdown. |
| Comments |
| Comment by Githook User [ 13/Aug/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 06169f718a7aec04f952979ffa6590e4334dea5a) |
| Comment by Githook User [ 12/Aug/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Suganthi Mani [ 07/Aug/20 ] |
|
Just adding some additional points to Eric, to my understanding, after durable history project, both repl startup and rollback state will be executing the WT method rollbackTostable(). |
| Comment by Eric Milkie [ 07/Aug/20 ] |
I'm not really sure why people would care whether a rollback was "replication rollback" or not. The method used to do rollback is almost the same – recover-to-timestamp is used both for replication rollback and at startup. So in some sense, it would be more accurate to count unclean shutdowns as a "rollback". Indeed, after we switched table logging to be almost all "unlogged", that had the effect of converting a lot of situations that would have been "replication rollback" into ones where the rollback occurred at startup recovery instead, so the metrics are already biased as of that change. |
| Comment by Judah Schvimer [ 07/Aug/20 ] |
|
AFAIK, the rollback ID is only used for initial sync, sync source selection post rollback via refetch, and downstream implementations of initial sync. One concern with incrementing the rollback ID would be that it would make metrics no longer accurate about how many replication rollbacks we had. We could fix that concern by adding a metric specifically for replication rollback counting. I would expect downstream implementations of initial sync to want the same fix as this bug. Sync source selection post rollback via refetch would also want the same fix. That uses the rollback ID to confirm that a potential sync source still has a node's minValid oplog entry. Since this is a way for a potential sync source to lose a minValid oplog entry, it would make sense to bump the rollback ID in this case. |
| Comment by Suganthi Mani [ 07/Aug/20 ] |
|
Even if it's not rollback ID, we can have separate id to track the unclean shutdown. But, I am also curious to know how we can distinguish between clean shutdown-restart and unclean shutdown-restart? Is it something by using this logic? or is it possible to useĀ "oplogTruncateAfterPoint" to get that information as this field will be cleared out during the clean shutdown on primary and this will be set on secondaries only if the secondaries crashes while writing oplog entries for a batch. |
| Comment by Matthew Russotto [ 07/Aug/20 ] |
|
Does it make sense to have a node increment its rollback ID during restart from unclean shutdown? Since it could have effectively "rolled back" even if no formal rollback happened? Or would that have undesirable side effects? |
| Comment by Eric Milkie [ 06/Aug/20 ] |
|
It sounds like this situation would be hard to hit in practice, since you would need to have a primary node undergo an unclean shutdown, restart, and then NOT undergo any rollback. |