[SERVER-21700] Do not relax constraints during steady state replication Created: 30/Nov/15  Updated: 10/Nov/23  Resolved: 12/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: Matthew Russotto
Resolution: Done Votes: 0
Labels: PM-843, former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-26358 Reconsider converting update to upser... Closed
is duplicated by SERVER-38724 Under rare conditions, updates to cap... Closed
is duplicated by SERVER-27594 Don't apply update oplog entries as u... Closed
Gantt Dependency
has to be done before SERVER-46221 Remove oplogApplicationEnforcesSteady... Open
Problem/Incident
causes SERVER-47022 Oplog application mode must be set to... Backlog
Related
related to SERVER-31387 oplog application conflates upserting... Closed
related to SERVER-32346 When applying applyOps commands durin... Closed
related to SERVER-32937 Relax renameCollection namespace leng... Closed
related to SERVER-43193 Always disable document validation on... Closed
related to SERVER-44450 Do not add fromMigrate field to apply... Closed
related to SERVER-46550 Enabling sharding creates a database ... Closed
related to SERVER-46603 disallow empty collection index build... Closed
related to SERVER-46656 Secondary and primary can disagree on... Closed
related to SERVER-50532 Cannot require that database exists w... Closed
related to SERVER-61532 Opcounters to detect constraint viola... Closed
related to SERVER-47383 Complete TODO listed in SERVER-46656 Closed
related to SERVER-71490 Move steady state replication constra... Closed
related to SERVER-27205 Remove implicit collection creation f... Closed
related to SERVER-33719 createCollectionForApplyOps should in... Closed
related to SERVER-45033 Log operations we do, not those we we... Closed
related to SERVER-48593 createIndexForApplyOps() should alway... Closed
is related to SERVER-47053 Relax oplog application constraints i... Closed
is related to SERVER-54150 Recovery from a stable checkpoint sho... Closed
is related to SERVER-21829 Nodes should consider themselves non-... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Repl 2020-02-10, Repl 2020-02-24, Repl 2020-03-09, Repl 2020-03-23
Participants:
Linked BF Score: 30

 Description   

In order to do this we would need to decide what should be done when a document cannot be found to update. The current behavior, of potentially creating a partial document, has some downsides when users make changes independent of replication, like during manual maintenance (or user/admin mistakes).

We may also only want to make this change during normal replication, not when re-applying from a failed batch, during rollback, or in other cases where we may be working with repeated operations.



 Comments   
Comment by Githook User [ 12/Mar/20 ]

Author:

{'name': 'Matthew Russotto', 'username': 'mtrussotto', 'email': 'matthew.russotto@10gen.com'}

Message: SERVER-21700 Do not relax constraints during steady state replication.
Branch: master
https://github.com/mongodb/mongo/commit/497f50a5e25db6171290f6e791ad02dd2b607498

Comment by Judah Schvimer [ 04/Dec/19 ]

This may depend on SERVER-44450 as described in this comment.

Comment by Judah Schvimer [ 06/Sep/19 ]

Note, we cannot allow document validation on secondaries since we can bypass it on primaries without logging that we do so in the oplog.

Comment by Judah Schvimer [ 03/Sep/19 ]

We also relax NamespaceNotFound errors in recovering and on deletes. I don't know why deletes here are special compared to updates or inserts. The justification for relaxing NamespaceNotFound errors in recovering only applies to eMRC=F now that the ticket mentioned in the todo in the code is complete.

Comment by A. Jesse Jiryu Davis [ 20/Aug/19 ]

Additional issue to consider: today, users can call applyOps directly with oplogApplicationMode: "InitialSync" and alwaysUpsert: false. We expect this to be not an upsert, and therefore to fail if the target document does not exist, see apply_ops_mode.js.

However, if applyOps is actually being executed during initial sync with alwaysUpsert: false and the target document does not exist, we ignore the failure and continue applying the rest of the operations in the applyOps command, as well as the rest of the oplog entries in the overall batch.

This is surprising. The behavior of applyOps differs depending on whether a user is calling it with oplogApplicationMode: "InitialSync" versus when it's actually executed during initial sync.

Comment by Judah Schvimer [ 12/Aug/19 ]

We may want or need to stop putting atomic applyOps commands into the oplog as the user gives them to us and start transforming them into the writes we actually do (but keep them in applyOps to be atomic and have no upgrade/downgrade concerns).

Comment by Judah Schvimer [ 09/Aug/19 ]

Note that during initial sync we currently do not turn updates into upserts, so we should never have to turn updates into upserts during oplog application. We will only upsert if the 'b' field is set or if an 'applyOps' oplog entry specifies to 'alwaysUpsert'.

Comment by Judah Schvimer [ 09/Aug/19 ]

We will need to do something about updates on capped collections. A document may be deleted on the secondary before the primary so an update that worked on the primary could fail on the secondary. This currently becomes an upsert, which can cause problems like SERVER-38724. Instead of turning updates into upserts on capped collections, just ignoring when an update modifies 0 documents on a capped collection (something we'll want to error on for normal collections) seems cleaner.

Comment by Ratika Gandhi [ 08/Aug/19 ]

Will be scheduled in Repl's Q3 quick win bucket

Comment by Judah Schvimer [ 08/Aug/19 ]

The plan is to:

  1. Turn this constraint enforcement on in our tests but not by default.
  2. We will add a serverStatus metric that we increment whenever one of these constraints is broken. i.e. even when the constraints aren't being enforced, when would their enforcement have caused a node to crash? This will let us collect data on 4.4 about whether future releases should turn the enforcement on by default.

As a result, I think it may make sense to do this all together. The implementer can decide.

An open question is "Should not being able to apply an op cause an election instead of a crash?".

Comment by Judah Schvimer [ 07/Aug/19 ]

I'll file separate tickets for 2-4 above and command acceptable errors. There will be 5 tickets in total. I'm going to mark them all "depends on" the oplog application refactor (though not part of that epic) so we're not messing around in there too much at once. Since it's been like this for a while I think it's ok not to backport it, so doing it after that will likely be easier.

Comment by Eric Milkie [ 07/Aug/19 ]

Great, I concur – can we change the title and description to reflect this? Or should we file separate tickets for that work... some of those might be harder than others and we may want to deploy those changes piecemeal.

Comment by Judah Schvimer [ 07/Aug/19 ]

I wasn't clear. I'm hoping this ticket can tighten up all of these constraints. I agree that we currently relax them all unnecessarily.

Comment by Eric Milkie [ 07/Aug/19 ]

In today's code are we continuing to do all 4, or have we tightened up some things already?

Comment by Judah Schvimer [ 07/Aug/19 ]

Outside of initial sync, recovering, and the applyOps command (so during steady state replication), we do not want to:

  1. turn updates into upserts
  2. turn inserts into upserts
  3. ignore index constraints
  4. ignore when we delete 0 documents applying a delete operation (we should always delete 1 document).

Command "acceptable errors" should also only be "acceptable" outside of steady state replication.

These being violated would all imply data corruption.

Generated at Thu Feb 08 03:58:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.