[SERVER-50794] When enableMajorityReadConcern is false, oplog is truncated before it is majority committed Created: 08/Sep/20 Updated: 27/Oct/23 Resolved: 10/Sep/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.0.20, 4.2.9, 4.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Evin Roesle |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | Repl 2020-09-21 |
| Participants: |
| Description |
|
Oplog is pinned as it is need for backup, crash recovery, and rollback (see getPinnedOplog()). When enableMajorityReadConcern is false, no oplog is needed for rollback, so oplog is only pinned for backup and crash recovery. For crash recovery, we only require oplog back to the last stable checkpoint, which can advance beyond the majority commit point. This means oplog can be truncated before it is majority committed. A result is that in a PSA set, it's easy for the secondary to fall off the back of the primary's oplog, particularly if secondary is down for a long period of time. However, an advantage is that the primary's disk doesn't fill up with oplog, so this could also be considered a feature. |
| Comments |
| Comment by Evin Roesle [ 10/Sep/20 ] |
|
After talking with support we have not seen anything in the field regarding rollback features that indicates this logic is a problem. I am closing this ticket. If a number of users begin to experience this issue please reopen this ticket. |
| Comment by Tess Avitabile (Inactive) [ 09/Sep/20 ] |
|
That's a good point, since EMRC=false is just maintaining the oplog truncation behavior from pre-4.0, and we haven't heard any complaints, then this is likely not a problem. I haven't heard of this coming up in the field, but evin.roesle was going to find out. I was specifically concerned about the scenario of a PSA set with EMRC=false, where S is down for a long time, and it falls off the back of the primary's oplog, so it requires a resync when it comes back up. |
| Comment by Daniel Gottlieb (Inactive) [ 09/Sep/20 ] |
|
If this is worth changing in 4.0, we may want to consider backporting to 3.6. Barring some more nuanced cases that the recoverable rollback project introduced[1], eMRC=off is analogous to how MongoDB has behaved since replication rollback was a thing – there was no behavior change. It was intentional (but certainly debatable if it was the best choice) for rollback via refetch in 4.0 to remain susceptible to failing if the common point is earlier than what rolling back node kept in its oplog. In a simpler (but hopefully not too over-simplified?) world where customers don't flip eMRC=on/off willy-nilly, the motivation for holding onto more oplog than the --oplogSize dictates when eMRC=on is because a process restart would need that oplog for startup replication recovery. It was deemed unacceptable to turn a restart into a possible re-initial sync scenario. The fact that holding onto additional oplog meant that recoverable rollback could succeed in rollback in more cases was incidental. When eMRC=off, the oplog is only needed in the case of a rollback. Have we observed anything in the field regarding rollback failures that makes us think this logic with eMRC=off is a problem now? [1] In a pessimistic world where customers do flip eMRC=on/off, a node can be running with eMRC=on, but not have the requisite oplog to perform a rollback (similar to upgrading from 3.6 -> 4.0 with eMRC=on where a rollback common point may no longer exist in that node's oplog). However, that's not what I would expect to be the common manifestation of a customer flipping the flag. I would sooner expect that a node with eMRC=on would fail to rollback (putting itself into a state that requires a resync?) even though it does* have the requisite oplog. This is because eMRC=on will never fallback to using rollback via refetch (which is a good thing in many respects. There are tremendous challenges to correctly rollback via refetch with timestamped checkpoints. See |
| Comment by Kelsey Schubert [ 08/Sep/20 ] |
|
I think users are also familiar with the guidance that their configured oplog size should cover the maintenance events. So growth is not necessary when a secondary goes down as the oplog is already adequately sized. |
| Comment by Bruce Lucas (Inactive) [ 08/Sep/20 ] |
|
Running out of disk space on the primary with S down (e.g for maintenance) is production-impacting and recovery may be difficult, while S falling off the oplog is not immediately production-impacting and is recoverable, so I imagine customers might see the current behavior as preferrable. In any case I think any change to this behavior would be a considerable change for customers - for example, they might need to provision more disk space guard against running out of disk space on the primary when the secondary is down. |