[SERVER-38356] Forbid dropping oplog in standalone mode on storage engines that support replSetResizeOplog Created: 03/Dec/18  Updated: 29/Oct/23  Resolved: 08/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.4
Fix Version/s: 4.2.1, 4.3.1

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: Vishnu Kaushik
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-12863 Investigate changes in SERVER-38356: ... Closed
Problem/Incident
Related
related to SERVER-38174 Starting replica set member standalon... Closed
related to SERVER-42129 Modify test to account for the epheme... Closed
related to SERVER-42131 Modify test to account for storage en... Closed
related to SERVER-44440 Consider disallowing users from writi... Backlog
related to DOCS-12230 Manual oplog resize in 4.0 after uncl... Closed
is related to SERVER-41792 Starting replica set member standalon... Closed
is related to TOOLS-2332 oplog_replay_local_rs.js fails on ser... Closed
is related to SERVER-41818 Add a new method in storage API for f... Closed
is related to SERVER-47558 Revert SERVER-38356 on 4.0 Closed
is related to SERVER-47567 Prevent incorrectly dropping oplog on... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2, v4.0
Sprint: Repl 2019-06-03, Repl 2019-06-17, Repl 2019-07-01, Repl 2019-07-15
Participants:
Linked BF Score: 47

 Description   

This ticket banned dropping the oplog in standalone mode entirely on storage engines that support the replSetResizeOplog command.

Original Description

Currently the oplog cannot be dropped while running in replset mode, but can be dropped as standalone. Until recently the procedure to resize the oplog included dropping the oplog while in standalone, however, doing this procedure on an uncleanly shutdown 4.0 mongod causes committed writes to be lost (because they only existed in the oplog, and the resize preserves only the final oplog entry, see DOCS-12230 and SERVER-38174 for more details). It would be much better if attempting this procedure in 4.0 did not result in oplog entries being lost, eg. if dropping the oplog failed.

Completely forbidding oplog drop (even when standalone) would interfere with the use case of restoring a filesystem snapshot as a test standalone. A better alternative would be to forbid dropping the oplog only if local.system.replset contains documents. This way, users who are sure they want to drop the oplog can do so by first removing the documents from local.system.replset (which can't be dropped, but can have its contents removed) and then restarting the standalone. Whereas users who are just trying to perform a manual oplog resize will be stopped before any data loss.

If we choose not to do this, then at the very least we should improve the "standalone-but-replset-config-exists" startup warning to specifically warn against to manually resizing the oplog.



 Comments   
Comment by Githook User [ 15/Apr/20 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: Revert "SERVER-38356 added functionality to forbid dropping the oplog, modified tests to get around Evergreen issue"

This reverts commit 58e4edb8237288f45f55cd8a59ea96a955489353.
Branch: v4.0
https://github.com/mongodb/mongo/commit/3715b6221884b30b15f183f813675e27f30123eb

Comment by Githook User [ 03/Sep/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-38356 Fix copydb_illegal_collections.js to not create
local.oplog.rs collection.
Branch: v4.0
https://github.com/mongodb/mongo/commit/d4ccbcfad2b7b47593054c3319f80b9ca922e066

Comment by Githook User [ 30/Aug/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-38356 added functionality to forbid dropping the oplog, modified tests to get around Evergreen issue

(cherry picked from commit a3244d8ac0ae530e2394248e72aadb27241adba3)
Branch: v4.0
https://github.com/mongodb/mongo/commit/58e4edb8237288f45f55cd8a59ea96a955489353

Comment by Githook User [ 28/Aug/19 ]

Author:

{'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}

Message: SERVER-38356 added functionality to forbid dropping the oplog, modified tests to get around Evergreen issue

(cherry picked from commit a3244d8ac0ae530e2394248e72aadb27241adba3)
Branch: v4.2
https://github.com/mongodb/mongo/commit/86584a342319393bd0cf68624f8738b94c721201

Comment by Githook User [ 08/Jul/19 ]

Author:

{'name': 'Vishnu Kaushik', 'username': 'kauboy26', 'email': 'vishnu.kaushik@mongodb.com'}

Message: SERVER-38356 added functionality to forbid dropping the oplog, modified tests to get around Evergreen issue
Branch: master
https://github.com/mongodb/mongo/commit/a3244d8ac0ae530e2394248e72aadb27241adba3

Comment by Judah Schvimer [ 19/Jun/19 ]

suganthi.mani, thanks for the detailed write up. I agree with it all.

Do we need to document this behavior?

I think we should file a docs ticket and let the docs team decide.

Comment by Suganthi Mani [ 18/Jun/19 ]

Below is the chart shows about oplog drop supportability for standalone nodes if we plan to implement as mentioned here.

Version Mmapv1 *WT + *EMRC false *WT + *EMRC true
4.0 Yes Yes No
4.2 Not Applicable No No

 *EMRC - enableMajorityReadConcern
*WT - WiredTiger.

As mentioned in this DOCS-12230, the problem is that if we allow dropping of oplog to perform manual resizing of oplog collection, then it can lead to missing entries while replaying oplog entries during startup recovery, leading to data inconsistencies between nodes. Consider the below case 
1) Lets say we have 2 node replica set (Primary & Secondary).
3) Secondary node gets killed in the middle of applying oplog batch (i.e. Unclean shut down). Let's assume, the ops got written to oplog but not yet applied. And, assume, the oplog has below 3 entries and the entries is for foo collection.

old.1
(storageRecoveryTs - EMRC true/
AppliedThroughTs -EMRC false)
old.2
(unapplied)
old.3
(unapplied)
old.4
(unapplied)
{ts:1, op:"i", o:{_id:1}} {ts:2, op:"i", o:{_id:2}} {ts:3, op:"i", o:{_id:3}} {ts:4, op:"i", o:{_id:4}}

3) Secondary node gets restarted as standalone.
4) As a result of manual oplog resizing, our oplog now contains only below entries.

new.1
(unapplied)
{ts:4, op:"i", o:{_id:4}}

5) Restart the secondary node again with --replSet. This means for

  • 4.0 with WiredTiger Storage engine
    • with EMRC=True (stable checkpoint), we would be replaying oplog entries greater than storage recoveryTimestamp (/stable checkpointTimestamp) to top of the oplog.
    • with EMRC=False (unstable check point), we would be replaying oplog entries from greater than AppliedThroughTimestamp to top of the oplog.
  • 4.2 with WiredTiger Storage engine
    • Regardless of EMRC value, we would be replaying oplog entries greater than storage recoveryTimestamp (/stable checkpointTimestamp) to top of the oplog.

       This means, we would miss applying the oplog entries in slot old.2 & old.3 (mentioned in step 3) during startup recovery. This would lead to data inconsistencies between this node and other nodes in the replica set.

I was trying to reproduce this problem. I was expecting startup recovery (replaying entries from oplog) would be successful and I would see data inconsistency (as per DOCS-12230). Instead, the server crashed with fatal assertion while trying to replay oplog entries during startup recovery as the old.1 entry was missing. And, its good we are not silently missing the data.  kevin.pulo, let me know if I am missing something.

Thoughts:

  • Anyways, whether it leads to data inconsistency or server crash, we should fix the problem. So, I would suggest banning of oplog drop from 4.0 onwards for wiredTiger storage engine regardless of enableMajorityReadConcern value and standalone mode.
    • To implement it, on 4.0 & 4.2, we can just check supportsRecoveryTimestamp() which returns true for WT storage engine regardless of EMRC value. And for mmapv1, it  returns false. 
    • I am also going to file a storage ticket to expose a storage Interface method which tells the support of replSetResizeOplog cmd for that storage engine. 
  • Since replSetResizeOplog command support is not available to mmapv1, only way to resize the oplog is by dropping the oplog. This means , for Mmapv1 storage engine, there is a possibility to see above server crash for unclean shutdowns as they also replay oplog entries from AppliedThroughTimestamp to top of oplog during startup recovery. And, we are ok with it.–> Do we need to document this behavior?
  • One more concern with the approach mentioned here is that, for 4.0, if we have a scenario for wiredTiger storage engine where 1) we start the node with --replSet & EMRC = true 2) Restart the node as standalone & EMRC = false, then supportsRecoverToStableTimestamp() returns false. This means, we would be able to drop the oplog. 3) Restart the node again with with --replSet & EMRC = true. so, on 4.0, its better to ban the oplog drop entirely for WT storage engine.

Let me know if anyone has any concerns on banning the oplog drop entirely for WiredTiger storage engine (that supports replSetResizeOplog cmd).

Comment by Tess Avitabile (Inactive) [ 05/Jun/19 ]

Is that intentional that on 4.0 for standalone nodes with enableMajorityReadConcern=false (supportsRecoverToStableTimestamp() is false) should not perform startup recovery by applying oplog entries from the recovery timestamp?

Good point. This behavior may not be correct if the user has just toggled enableMajorityReadConcern. On 4.0, when enableMajorityReadConcern=false, the server takes unstable checkpoints, so it should not perform startup recovery by applying oplog entries. In this case, it is correct that standalone nodes with enableMajorityReadConcern=false do not perform startup recovery by applying oplog entries. However, if the user was running with enableMajorityReadConcern=true, then restarted in standalone mode with enableMajorityReadConcern=false and recoverFromOplogAsStandalone, then it will start up from a stable checkpoint, in which case it should perform recovery by applying oplog entries. We should probably make the decision of whether to apply oplog entries when enableMajorityReadConcern=false and recoverFromOplogAsStandalon=true based on the type of checkpoint we start up from, so it sounds like this may be a bug.

As far as I can tell supporstRecoverToStableTimestamp() and supportsRecoveryTimestamp() are essentially the same on 4.2 and 4.0. William Schultz or Daniel Gottlieb do you know what Tess had in mind?

We have these two predicates to distinguish between the ability to perform rollback using RTT (which we never do when enableMajorityReadConcern=false) and the ability to start up from a stable checkpoint (which we essentially always do on 4.2 when enableMajorityReadConcern=false, and we do on 4.0 when enableMajorityReadConcern=false only if the server had been shut down with enableMajorityReadConcern=true).

Comment by Judah Schvimer [ 05/Jun/19 ]

The concern here is that if on clean restart the node has not applied all of its oplog entries, then we do not want to allow dropping the oplog. All storage engines that allow a clean restart to not have applied all oplog entries also support the replSetResizeOplog command, so they do not need to allow dropping the oplog. As far as I can tell supportsRecoverToStableTimestamp() and supportsRecoveryTimestamp() are essentially the same on 4.2 and 4.0. william.schultz or daniel.gottlieb do you know what Tess had in mind?

Comment by Suganthi Mani [ 05/Jun/19 ]

tess.avitabile/judah.schvimer Just wanted to clarify on the solution for 4.0, why can't we have the same check (supportsRecoveryTimestamp() is true) as 4.2 on 4.0?

And, other thing, I noticed is that if a node is standalone and if server parameter recoverFromOplogAsStandalone is set to true, we perform startup recovery by applying oplog entries from the recovery timestamp, provided

Is that intentional that on 4.0 for standalone nodes with enableMajorityReadConcern=false (supportsRecoverToStableTimestamp() is false) should not perform startup recovery by applying oplog entries from the recovery timestamp?

Comment by Tess Avitabile (Inactive) [ 08/Jan/19 ]

Sounds good. We can forbid dropping local.oplog.rs on 4.0 if supportsRecoverToStableTimestamp() is true and on 4.2 if supportsRecoveryTimestamp() is true (on 4.2 with enableMajorityReadConcern=false, supportsRecoverToStableTimestamp() is false, but we still perform startup recovery by applying oplog entries from the recovery timestamp). I'll put this into the quick wins for next quarter.

Comment by Kevin Pulo [ 08/Jan/19 ]

The main problem with completely forbidding dropping the oplog is that it wouldn't be backportable to 4.0, because it's still the only way to resize the oplog in MMAPv1. But this whole issue only exists for storage engines that support recovery to timestamp. So how about we prevent dropping local.oplog.rs if supportsRecoverToStableTimestamp() is true?

Comment by Asya Kamsky [ 21/Dec/18 ]

Why not forbid dropping the oplog entirely?

I don't see a need for force:true because if you know what you are doing you can drop it anyway.

If you are converting the replica backup to a standalone you should just drop the local database which avoids any sort of inconsistency issue.

Comment by Kevin Pulo [ 20/Dec/18 ]

I'm surprised by the aversion to adding force: true. Although drop is a DDL command, the situation we're talking about — dropping the oplog (already a special internal system collection) while in a special state (standalone after unclean shutdown) — is maintenance, not a regular operation. This is compounded by the strong potential for unexpected data loss in this situation. There are several other maintenance commands (including within repl) which use force: true (and have for a long time) when we want safe behavior by default, but still need to permit risky operations in rare maintenance situations:

  • replSetReconfig
  • replSetStepDown
  • compact
  • shutdown
  • splitVector

For a startup warning to have a chance of being noticed, it would need to be a separate new warning from the existing ones, and would need to specifically call out that dropping the oplog while in this state (standalone after unclean shutdown) is likely to result in data loss, and that the supported method of resizing the oplog has changed, with a link to the relevant docs. As previously mentioned, in addition to not being noticed, there are other failure modes for this approach, eg. a pre-existing mongo shell will not re-check startup warnings when reconnecting (I've just filed SERVER-38718 for this).

Comment by Gregory McKeon (Inactive) [ 17/Dec/18 ]

We're worried about adding a "force" parameter for only a single command - this would be inconsistent with our other DDL ops.

arnie.listhaus also suggested doing replication recovery at startup by default when in standalone mode. We don't want to do this because it interferes with maintenance that is performed in standalone mode, such as truncating the oplog for point-in-time backups and diagnosing the cache pressure of replication recovery.

Adding startup warning letting users know that they no longer need to drop the oplog to resize it is our preferred option - do you think this would be noticed enough by users to be effective, kevin.pulo arnie.listhaus?

Comment by Kevin Pulo [ 12/Dec/18 ]

Ok, that's fair enough.

How about instead requiring a force: true parameter to the drop command, when in this state? The error message could educate the admin about this issue, refer them to the docs and the replSetResizeOplog command, etc. And that if they really want to drop the oplog, they can re-run the drop command with force: true.

This should prevent any accidents before they actually happen, while also still allowing arbitrary maintenance in the rare cases it might be necessary, and without being a huge development burden.

Comment by Gregory McKeon (Inactive) [ 10/Dec/18 ]

We want to enable users to do arbitrary maintenance in standalone mode, so we don't want to ban dropping the oplog. We don't think adding a startup warning would be helpful, because it doesn't occur at the same time the user performs the drop. If you feel strongly about the warning, let us know.

Generated at Thu Feb 08 04:48:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.