[SERVER-21700] Do not relax constraints during steady state replication Created: 30/Nov/15 Updated: 10/Nov/23 Resolved: 12/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Scott Hernandez (Inactive) | Assignee: | Matthew Russotto |
| Resolution: | Done | Votes: | 0 |
| Labels: | PM-843, former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-02-10, Repl 2020-02-24, Repl 2020-03-09, Repl 2020-03-23 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 30 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In order to do this we would need to decide what should be done when a document cannot be found to update. The current behavior, of potentially creating a partial document, has some downsides when users make changes independent of replication, like during manual maintenance (or user/admin mistakes). We may also only want to make this change during normal replication, not when re-applying from a failed batch, during rollback, or in other cases where we may be working with repeated operations. |
| Comments |
| Comment by Githook User [ 12/Mar/20 ] |
|
Author: {'name': 'Matthew Russotto', 'username': 'mtrussotto', 'email': 'matthew.russotto@10gen.com'}Message: |
| Comment by Judah Schvimer [ 04/Dec/19 ] |
|
This may depend on |
| Comment by Judah Schvimer [ 06/Sep/19 ] |
|
Note, we cannot allow document validation on secondaries since we can bypass it on primaries without logging that we do so in the oplog. |
| Comment by Judah Schvimer [ 03/Sep/19 ] |
|
We also relax NamespaceNotFound errors in recovering and on deletes. I don't know why deletes here are special compared to updates or inserts. The justification for relaxing NamespaceNotFound errors in recovering only applies to eMRC=F now that the ticket mentioned in the todo in the code is complete. |
| Comment by A. Jesse Jiryu Davis [ 20/Aug/19 ] |
|
Additional issue to consider: today, users can call applyOps directly with oplogApplicationMode: "InitialSync" and alwaysUpsert: false. We expect this to be not an upsert, and therefore to fail if the target document does not exist, see apply_ops_mode.js. However, if applyOps is actually being executed during initial sync with alwaysUpsert: false and the target document does not exist, we ignore the failure and continue applying the rest of the operations in the applyOps command, as well as the rest of the oplog entries in the overall batch. This is surprising. The behavior of applyOps differs depending on whether a user is calling it with oplogApplicationMode: "InitialSync" versus when it's actually executed during initial sync. |
| Comment by Judah Schvimer [ 12/Aug/19 ] |
|
We may want or need to stop putting atomic applyOps commands into the oplog as the user gives them to us and start transforming them into the writes we actually do (but keep them in applyOps to be atomic and have no upgrade/downgrade concerns). |
| Comment by Judah Schvimer [ 09/Aug/19 ] |
|
Note that during initial sync we currently do not turn updates into upserts, so we should never have to turn updates into upserts during oplog application. We will only upsert if the 'b' field is set or if an 'applyOps' oplog entry specifies to 'alwaysUpsert'. |
| Comment by Judah Schvimer [ 09/Aug/19 ] |
|
We will need to do something about updates on capped collections. A document may be deleted on the secondary before the primary so an update that worked on the primary could fail on the secondary. This currently becomes an upsert, which can cause problems like |
| Comment by Ratika Gandhi [ 08/Aug/19 ] |
|
Will be scheduled in Repl's Q3 quick win bucket |
| Comment by Judah Schvimer [ 08/Aug/19 ] |
|
The plan is to:
As a result, I think it may make sense to do this all together. The implementer can decide. An open question is "Should not being able to apply an op cause an election instead of a crash?". |
| Comment by Judah Schvimer [ 07/Aug/19 ] |
|
I'll file separate tickets for 2-4 above and command acceptable errors. There will be 5 tickets in total. I'm going to mark them all "depends on" the oplog application refactor (though not part of that epic) so we're not messing around in there too much at once. Since it's been like this for a while I think it's ok not to backport it, so doing it after that will likely be easier. |
| Comment by Eric Milkie [ 07/Aug/19 ] |
|
Great, I concur – can we change the title and description to reflect this? Or should we file separate tickets for that work... some of those might be harder than others and we may want to deploy those changes piecemeal. |
| Comment by Judah Schvimer [ 07/Aug/19 ] |
|
I wasn't clear. I'm hoping this ticket can tighten up all of these constraints. I agree that we currently relax them all unnecessarily. |
| Comment by Eric Milkie [ 07/Aug/19 ] |
|
In today's code are we continuing to do all 4, or have we tightened up some things already? |
| Comment by Judah Schvimer [ 07/Aug/19 ] |
|
Outside of initial sync, recovering, and the applyOps command (so during steady state replication), we do not want to:
Command "acceptable errors" should also only be "acceptable" outside of steady state replication. These being violated would all imply data corruption. |