[SERVER-38174] Starting replica set member standalone can lose committed writes starting in MongoDB 4.0 Created: 16/Nov/18  Updated: 27/Oct/23  Resolved: 28/Nov/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Judah Schvimer
Resolution: Works as Designed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-12213 Clarify when and how to put replica s... Closed
Related
is related to DOCS-12230 Manual oplog resize in 4.0 after uncl... Closed
is related to SERVER-38356 Forbid dropping oplog in standalone m... Closed
Operating System: ALL
Sprint: Repl 2018-12-03
Participants:

 Description   

Starting in 4.0 we do recovery at startup (for everything but the oplog) by replaying the oplog instead of by journaling. However this is not done by default if a replica set member is started standalone, so we can lose committed writes in this case. Restarting a replica set member standalone is a common documented maintenance procedure, so losing committed writes in this case seems possibly problematic. This is also a change in behavior from 3.6.

There is an undocumented parameter to do oplog replay when started as standalone; suggest that this should be enabled by default.



 Comments   
Comment by Andy Schwerin [ 19/Nov/18 ]

I just filed DOCS-12213 to cover the documentation. I was going to make the same proposal, judah.schvimer.

Comment by Judah Schvimer [ 19/Nov/18 ]

I think you bring up valid reasons for documenting this unexpected behavior. Anyone opposed to closing this as "Works As Designed" with documentation changes needed?

Comment by Bruce Lucas (Inactive) [ 17/Nov/18 ]

OK. I think we need to document and communicate more clearly

  • the supported maintenance operations running standalone
  • the existence and necessity of the recoverFromOplogAsStandalone flag for maintenance operations that require read access to data, e.g. data recovery
  • the difference in behavior between 3.6 and 4.0
Comment by Andy Schwerin [ 16/Nov/18 ]

Maintenance operations that change the data are dangerous on standalones anyways. The only safe maintenance would be writes to local collections and index builds. The index builds are only safe in retrospect if after rejoining the cluster a conflicting document is not found, but that was true in 3.6, too.

We didn’t enter this decision lightly, and we did do it intentionally, so I think this is as-designed.

Comment by Bruce Lucas (Inactive) [ 16/Nov/18 ]

Independent of particular use cases, we support an invariant that committed writes can be read, and it seems surprising that that invariant doesn't hold when you start the node standalone, especially when it did in 3.6. I'm leery of saying that there are no legitimate maintenance procedures that depend on reading committed writes.

Comment by Eric Milkie [ 16/Nov/18 ]

Your concerns about rolling index builds are a problem regardless of whether or not the writes are pending in the local oplog buffer or pending in an upstream node's buffer, even before this new behavior was added.

Comment by Bruce Lucas (Inactive) [ 16/Nov/18 ]

Agreed, "lost" is not quite the right word.

I don't know that we restrict maintenance operations to rolling index builds; our documentation describes it as a general procedure for maintenance.

Wouldn't rolling index builds though be an example of something that could misbehave, for example if the unapplied oplog entries contain a conflicting index build, or drop the index being built, or contain writes that conflict with the index (e.g. violate unique key constraints)?

Comment by Judah Schvimer [ 16/Nov/18 ]

"recoverFromOplogAsStandalone" puts the node in read-only mode, so I do not think that would be an acceptable default. We also have a startup warning that calls this out explicitly: SERVER-30464. Given that restarting the node as a replica set node brings the node back up to a consistent state, I don't think this counts as data loss since the data is still there, it's just not visible in standalone mode.

Comment by Eric Milkie [ 16/Nov/18 ]

I don't understand how the writes can be lost? Certainly, shutting down a secondary leaves it in some state with some amount of writes on it (but the exact number is not explicitly controllable by the user). I would expect users are using this mode to build indexes in a rolling fashion, or to run validate. I don't expect users to be looking at the data on a standalone – is that what you mean by "losing committed writes"? The writes aren't lost, as once you put the node back in the replica set it picks up where it left off, as long as you didn't corrupt the data while in standalone mode by performing writes.

Generated at Thu Feb 08 04:48:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.