[SERVER-30460] Grow the oplog when the replication commit point falls behind the back of where it would normally be truncated to Created: 01/Aug/17  Updated: 30/Oct/23  Resolved: 03/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.1.4

Type: Improvement Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: prepare_durability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-29215 Coordinate oplog truncate point with ... Closed
Related
related to SERVER-29125 Add $changeNotification stage that al... Closed
is related to SERVER-22766 Dynamic oplog sizing for WiredTiger n... Closed
is related to SERVER-36494 Prevent oplog truncation of oplog ent... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-11-19, Repl 2018-12-03, Repl 2018-12-17, Repl 2019-01-14
Participants:

 Description   

If the replication majority commit point is far enough behind the primary that the oplog entry on the primary that corresponds to the majority point falls to the back of its oplog and gets deleted, this can cause a majority of the secondaries to become too stale to continue replication, and thus require a full resync. Having a majority of a set's nodes being in the process of initial sync would mean that there's no healthy majority to elect a primary, thus the set would then have a prolonged period of no write availability.

One possible mitigation for this problem is to prevent the primary from deleting ops from its oplog that are ahead or equal to the replication commit point, so that there will always be a common point between the oplogs of the majority of the secondaries and the primary.



 Comments   
Comment by Pavithra Vetriselvan [ 03/Jan/19 ]

Closing this ticket because SERVER-36811 should have solved the problem of nodes truncating oplog entries ahead of the commit point since the stableTimestamp will be less than or equal to the replication commit point.

This issue is only fixed when enableMajorityReadConcern=true.

Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ]

When resolving this ticket, please clarify that the ticket is only fixed when enableMajorityReadConcern=true.

Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ]

I might be confused, apologies for the clarification request. It sounded like this ticket was being repurposed to ensure the oplog has enough history for recovery when there are in-flight prepared transactions (prepared behind the commit point):

As Andy was saying, it seems like we will hit this problem when a prepared but uncommitted/unaborted transaction stays in prepare for a while.

When the stable timestamp is allowed to move in front of prepared transactions, there would still be a required change for preventing oplog truncation from deleting these in-flight prepared oplog entries. Perhaps that's being tracked elsewhere (SERVER-36494?), I was really just responding to:

If we were to not use the functionality provided by SERVER-22766 and instead specified a Timestamp to an oplog entry that we can truncate, is there work that storage would need to complete first?

As far as we know, there's no more storage work required for that.

Comment by Judah Schvimer [ 02/Jan/19 ]

Since the stableTimestamp is less than or equal to the replication commit point, it seems that this ticket (preventing nodes from truncating newer oplog entries than the commit point) was accomplished in SERVER-36811 regardless of what the truncationTimestamp is set to. daniel.gottlieb, do you agree?

Comment by Daniel Gottlieb (Inactive) [ 20/Dec/18 ]

If the only concern is for keeping around prepared, uncommitted oplog entries, support for that was put into storage with SERVER-36811. SERVER-36494 is the repl ticket for tracking the earliest oplog entry that storage must preserve in order to recover from a given stable timestamp.

A clarification: the Storage API provided with SERVER-36811 is really to express the more general contract:
setStableTimestamp(stableTimestamp, truncationTimestamp)
To recover forward in the oplog starting at the stableTimestamp, storage is guaranteeing to preserve all oplog dating back to min(stableTimestamp, truncationTimestamp).

Storage doesn't require that the truncationTimestamp is in any way related to a prepared transaction.

Comment by Pavithra Vetriselvan [ 20/Dec/18 ]

As Andy was saying, it seems like we will hit this problem when a prepared but uncommitted/unaborted transaction stays in prepare for a while. Did we ever reach a conclusion on allowing the oplog to grow unbounded in size or is that solved by SERVER-22766 (I guess by continuing to resize the oplog)?

If we were to not use the functionality provided by SERVER-22766 and instead specified a Timestamp to an oplog entry that we can truncate, is there work that storage would need to complete first?

Comment by Judah Schvimer [ 20/Sep/18 ]

Given the current design of "prepare" I want to re-investigate what is actually required by this ticket.

Comment by Andy Schwerin [ 10/Aug/17 ]

I don't believe SERVER-22766 provides the requisite functionality. What replication needs is to specify the timestamp of the newest oplog entry that may be truncated, rather than the new maximum size of the oplog. That seems at least partially like storage work. I believe it will be required in 3.8, to support prepared but uncommitted transactions that sit in prepare for a long time. It can stay on the repl backlog, but we will need to collaborate on the design.

Comment by Michael Cahill (Inactive) [ 07/Aug/17 ]

schwerin, the storage-level mechanism was added in SERVER-22766. I think the decision about when to resize (i.e., this ticket) belongs to repl. Let me know if you disagree or need more from storage to support this work.

Comment by Daniel Gottlieb (Inactive) [ 06/Aug/17 ]

Do we just want to allow the oplog to grow unbounded in size?

Comment by Andy Schwerin [ 06/Aug/17 ]

We're going to have to do this eventually, and then some. Multi-shard transactions will require that the oldest operation of any transaction on a shard in the "prepared" state must remain in the oplog until the transaction is committed or aborted by the coordinator.

Comment by Daniel Gottlieb (Inactive) [ 06/Aug/17 ]

It wasn't mandatory, but there was a ticket filed. SERVER-29215

Comment by Andy Schwerin [ 06/Aug/17 ]

Somehow, I thought this was part of the mandatory work for the WT engine for one of the storage timestamps projects. Is there not a ticket already filed to this effect, michael.cahill?

Comment by Daniel Gottlieb (Inactive) [ 01/Aug/17 ]

Straightforward for WiredTiger*

Comment by Spencer Brody (Inactive) [ 01/Aug/17 ]

I think I'm going to leave this ticket parked with the replication team for now since I think ultimately it's us who needs to decide what (if anything) we want to do about lagged secondaries. If we decide this is something we want to do, I will send this over to the storage team for the actual implementation, which according to daniel.gottlieb is pretty straightforward.

Comment by Eric Milkie [ 01/Aug/17 ]

This might be easier for storage to do, since we could just amend the capped deleter to not delete anything newer than the oldest timestamp (which will be set when setCommittedSnapshot is called).

Comment by Spencer Brody (Inactive) [ 01/Aug/17 ]

The alternative to this is to provide backpressure: slowing or failing writes on the primary when the commit point is in danger of falling off its oplog.

Comment by Spencer Brody (Inactive) [ 01/Aug/17 ]

If we did this we would have to figure out whether we let the oplog grow unbounded (up to the limits of available storage space), or give it some other limit, though any other limit feels arbitrary since we're already letting the oplog exceed the max size the user has configured it to have.

Generated at Thu Feb 08 04:23:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.