[SERVER-28068] Do not go into rollback due to falling off the back of your sync source's oplog Created: 21/Feb/17 Updated: 08/Nov/22 Resolved: 21/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Xuerui Fa |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-04-20, Repl 2020-05-04, Repl 2020-05-18, Repl 2020-06-01 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Currently we treat falling off the back of the oplog the same as being on a divergent branch of history here. Before deciding to go into rollback, we should check if we've fallen off the back of our sync source's oplog and if so switch sync sources (with an InvalidSyncSource error) instead of rolling back (OplogStartMissing error). We can do this exactly the same as how it's done in the SyncSourceResolver by comparing our last fetched optime to the sync source's oldest optime. |
| Comments |
| Comment by Githook User [ 21/May/20 ] |
|
Author: {'name': 'Xuerui Fa', 'email': 'xuerui.fa@mongodb.com', 'username': 'XueruiFa'}Message: |
| Comment by Judah Schvimer [ 11/Mar/19 ] |
|
This will likely be much less important after the initial sync semantics project, since nodes will no longer fall off the back of their sync sources' oplogs, at least in Atlas. |
| Comment by Judah Schvimer [ 08/Mar/19 ] |
|
In discussion with jason.chan I realized that this is more common than I previously thought. If the oplog fetching cursor is killed due to falling off the back of the oplog, it will return an error to the OplogFetcher causing the node to retry it's find command on the oplog. This doesn't go through sync source resolution, so it will go into rollback instead of seeing that it's fallen off its sync source's oplog. One way to easily disambiguate these cases is by checking the term. I think we can check the term of the two oplog entries here. If the terms are the same, I think it definitely means we fell off the back of our sync source's oplog since we are on the same branch of history. This doesn't make it so we never go into rollback when we fell off the back of our sync source's oplog, but it does fix it in almost all occurrences where there are not frequent term changes. siyuan.zhou and tess.avitabile, thoughts? I still don't think this is high priority, since this is rare and changing sync sources isn't so likely to save the lagging node. |
| Comment by Githook User [ 04/Dec/17 ] |
|
Author: {'username': 'xdg', 'email': 'xdg@xdg.me', 'name': 'David Golden'}Message: This avoids a bug due to https://jira.mongodb.org/browse/SERVER-28068 |
| Comment by Githook User [ 04/Dec/17 ] |
|
Author: {'username': 'xdg', 'email': 'xdg@xdg.me', 'name': 'David Golden'}Message: This avoids a bug due to https://jira.mongodb.org/browse/SERVER-28068 |
| Comment by Githook User [ 29/Nov/17 ] |
|
Author: {'name': 'David Golden', 'username': 'xdg', 'email': 'xdg@xdg.me'}Message: This avoids a bug due to https://jira.mongodb.org/browse/SERVER-28068 |
| Comment by Judah Schvimer [ 14/Mar/17 ] |
|
No it is still possible. When this happens we will always get an UnrecoverableRollbackException and crash, so I do not think we will ever rollback writes that shouldn't be rolled back. Additionally, this happens when our sync source is a valid sync source besides being too far ahead, so we should trust its history. |
| Comment by Crystal Horn [ 09/Mar/17 ] |
|
judah.schvimer, was this done in your recent commit? Can this be closed now? |
| Comment by Spencer Brody (Inactive) [ 21/Feb/17 ] |
|
Okay, yeah, I guess it would indeed be better to behave as we do when we come up already too stale to sync from anyone, which is that we log that we are too stale and then go into RECOVERING (and will continue to look for viable sync sources to become available, after |
| Comment by Judah Schvimer [ 21/Feb/17 ] |
|
We can't find a common point, get an UnrecoverableRollbackException, and fassert. |
| Comment by Spencer Brody (Inactive) [ 21/Feb/17 ] |
|
What's the current behavior if we go into ROLLBACK but then are too stale to sync from anyone? |