[SERVER-50794] When enableMajorityReadConcern is false, oplog is truncated before it is majority committed Created: 08/Sep/20  Updated: 27/Oct/23  Resolved: 10/Sep/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.20, 4.2.9, 4.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Evin Roesle
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Repl 2020-09-21
Participants:

 Description   

Oplog is pinned as it is need for backup, crash recovery, and rollback (see getPinnedOplog()). When enableMajorityReadConcern is false, no oplog is needed for rollback, so oplog is only pinned for backup and crash recovery. For crash recovery, we only require oplog back to the last stable checkpoint, which can advance beyond the majority commit point. This means oplog can be truncated before it is majority committed.

A result is that in a PSA set, it's easy for the secondary to fall off the back of the primary's oplog, particularly if secondary is down for a long period of time. However, an advantage is that the primary's disk doesn't fill up with oplog, so this could also be considered a feature.



 Comments   
Comment by Evin Roesle [ 10/Sep/20 ]

After talking with support we have not seen anything in the field regarding rollback features that indicates this logic is a problem. I am closing this ticket.

If a number of users begin to experience this issue please reopen this ticket.

Comment by Tess Avitabile (Inactive) [ 09/Sep/20 ]

That's a good point, since EMRC=false is just maintaining the oplog truncation behavior from pre-4.0, and we haven't heard any complaints, then this is likely not a problem. I haven't heard of this coming up in the field, but evin.roesle was going to find out. I was specifically concerned about the scenario of a PSA set with EMRC=false, where S is down for a long time, and it falls off the back of the primary's oplog, so it requires a resync when it comes back up.

Comment by Daniel Gottlieb (Inactive) [ 09/Sep/20 ]

If this is worth changing in 4.0, we may want to consider backporting to 3.6. Barring some more nuanced cases that the recoverable rollback project introduced[1], eMRC=off is analogous to how MongoDB has behaved since replication rollback was a thing – there was no behavior change. It was intentional (but certainly debatable if it was the best choice) for rollback via refetch in 4.0 to remain susceptible to failing if the common point is earlier than what rolling back node kept in its oplog.

In a simpler (but hopefully not too over-simplified?) world where customers don't flip eMRC=on/off willy-nilly, the motivation for holding onto more oplog than the --oplogSize dictates when eMRC=on is because a process restart would need that oplog for startup replication recovery. It was deemed unacceptable to turn a restart into a possible re-initial sync scenario. The fact that holding onto additional oplog meant that recoverable rollback could succeed in rollback in more cases was incidental. When eMRC=off, the oplog is only needed in the case of a rollback.

Have we observed anything in the field regarding rollback failures that makes us think this logic with eMRC=off is a problem now?

[1] In a pessimistic world where customers do flip eMRC=on/off, a node can be running with eMRC=on, but not have the requisite oplog to perform a rollback (similar to upgrading from 3.6 -> 4.0 with eMRC=on where a rollback common point may no longer exist in that node's oplog). However, that's not what I would expect to be the common manifestation of a customer flipping the flag. I would sooner expect that a node with eMRC=on would fail to rollback (putting itself into a state that requires a resync?) even though it does* have the requisite oplog. This is because eMRC=on will never fallback to using rollback via refetch (which is a good thing in many respects. There are tremendous challenges to correctly rollback via refetch with timestamped checkpoints. See SERVER-38925 and SERVER-48518).

Comment by Kelsey Schubert [ 08/Sep/20 ]

I think users are also familiar with the guidance that their configured oplog size should cover the maintenance events. So growth is not necessary when a secondary goes down as the oplog is already adequately sized.

Comment by Bruce Lucas (Inactive) [ 08/Sep/20 ]

Running out of disk space on the primary with S down (e.g for maintenance) is production-impacting and recovery may be difficult, while S falling off the oplog is not immediately production-impacting and is recoverable, so I imagine customers might see the current behavior as preferrable.

In any case I think any change to this behavior would be a considerable change for customers - for example, they might need to provision more disk space guard against running out of disk space on the primary when the secondary is down.

Generated at Thu Feb 08 05:23:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.