[SERVER-30460] Grow the oplog when the replication commit point falls behind the back of where it would normally be truncated to Created: 01/Aug/17 Updated: 30/Oct/23 Resolved: 03/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.4 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Pavithra Vetriselvan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_durability | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Sprint: | Repl 2018-11-19, Repl 2018-12-03, Repl 2018-12-17, Repl 2019-01-14 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
If the replication majority commit point is far enough behind the primary that the oplog entry on the primary that corresponds to the majority point falls to the back of its oplog and gets deleted, this can cause a majority of the secondaries to become too stale to continue replication, and thus require a full resync. Having a majority of a set's nodes being in the process of initial sync would mean that there's no healthy majority to elect a primary, thus the set would then have a prolonged period of no write availability. One possible mitigation for this problem is to prevent the primary from deleting ops from its oplog that are ahead or equal to the replication commit point, so that there will always be a common point between the oplogs of the majority of the secondaries and the primary. |
| Comments |
| Comment by Pavithra Vetriselvan [ 03/Jan/19 ] |
|
Closing this ticket because This issue is only fixed when enableMajorityReadConcern=true. |
| Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ] |
|
When resolving this ticket, please clarify that the ticket is only fixed when enableMajorityReadConcern=true. |
| Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ] |
|
I might be confused, apologies for the clarification request. It sounded like this ticket was being repurposed to ensure the oplog has enough history for recovery when there are in-flight prepared transactions (prepared behind the commit point):
When the stable timestamp is allowed to move in front of prepared transactions, there would still be a required change for preventing oplog truncation from deleting these in-flight prepared oplog entries. Perhaps that's being tracked elsewhere (
As far as we know, there's no more storage work required for that. |
| Comment by Judah Schvimer [ 02/Jan/19 ] |
|
Since the stableTimestamp is less than or equal to the replication commit point, it seems that this ticket (preventing nodes from truncating newer oplog entries than the commit point) was accomplished in |
| Comment by Daniel Gottlieb (Inactive) [ 20/Dec/18 ] |
|
If the only concern is for keeping around prepared, uncommitted oplog entries, support for that was put into storage with A clarification: the Storage API provided with Storage doesn't require that the truncationTimestamp is in any way related to a prepared transaction. |
| Comment by Pavithra Vetriselvan [ 20/Dec/18 ] |
|
As Andy was saying, it seems like we will hit this problem when a prepared but uncommitted/unaborted transaction stays in prepare for a while. Did we ever reach a conclusion on allowing the oplog to grow unbounded in size or is that solved by If we were to not use the functionality provided by |
| Comment by Judah Schvimer [ 20/Sep/18 ] |
|
Given the current design of "prepare" I want to re-investigate what is actually required by this ticket. |
| Comment by Andy Schwerin [ 10/Aug/17 ] |
|
I don't believe |
| Comment by Michael Cahill (Inactive) [ 07/Aug/17 ] |
|
schwerin, the storage-level mechanism was added in |
| Comment by Daniel Gottlieb (Inactive) [ 06/Aug/17 ] |
|
Do we just want to allow the oplog to grow unbounded in size? |
| Comment by Andy Schwerin [ 06/Aug/17 ] |
|
We're going to have to do this eventually, and then some. Multi-shard transactions will require that the oldest operation of any transaction on a shard in the "prepared" state must remain in the oplog until the transaction is committed or aborted by the coordinator. |
| Comment by Daniel Gottlieb (Inactive) [ 06/Aug/17 ] |
|
It wasn't mandatory, but there was a ticket filed. |
| Comment by Andy Schwerin [ 06/Aug/17 ] |
|
Somehow, I thought this was part of the mandatory work for the WT engine for one of the storage timestamps projects. Is there not a ticket already filed to this effect, michael.cahill? |
| Comment by Daniel Gottlieb (Inactive) [ 01/Aug/17 ] |
|
Straightforward for WiredTiger* |
| Comment by Spencer Brody (Inactive) [ 01/Aug/17 ] |
|
I think I'm going to leave this ticket parked with the replication team for now since I think ultimately it's us who needs to decide what (if anything) we want to do about lagged secondaries. If we decide this is something we want to do, I will send this over to the storage team for the actual implementation, which according to daniel.gottlieb is pretty straightforward. |
| Comment by Eric Milkie [ 01/Aug/17 ] |
|
This might be easier for storage to do, since we could just amend the capped deleter to not delete anything newer than the oldest timestamp (which will be set when setCommittedSnapshot is called). |
| Comment by Spencer Brody (Inactive) [ 01/Aug/17 ] |
|
The alternative to this is to provide backpressure: slowing or failing writes on the primary when the commit point is in danger of falling off its oplog. |
| Comment by Spencer Brody (Inactive) [ 01/Aug/17 ] |
|
If we did this we would have to figure out whether we let the oplog grow unbounded (up to the limits of available storage space), or give it some other limit, though any other limit feels arbitrary since we're already letting the oplog exceed the max size the user has configured it to have. |