[SERVER-37897] Disable table logging for data files when enableMajorityReadConcern=false Created: 02/Nov/18  Updated: 29/Oct/23  Resolved: 10/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.1.7

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-38041 Test single shard transactions with a... Closed
Related
is related to SERVER-38925 Rollback via refetch can cause _id du... Closed
is related to SERVER-37728 Disabling majority reads negates some... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-12-03, Repl 2018-12-17
Participants:
Linked BF Score: 66

 Description   

InĀ SERVER-37227, we reintroduced the enableMajorityReadConcern:false parameter on the master branch. When majority read concern is disabled, we turn on logging for data files and take unstable checkpoints. This is incompatible with causally consistent sharded cluster backup, which requires us to be able to replay the oplog to a particular point in time. We must do the following to disable table logging for data files when enableMajorityReadConcern=false:

  • Disable table logging for data files, which was enabled here.
  • Take stable checkpoints. In addition to setting the oldest timestamp in the WiredTigerOplogManager, we will set the stable timestamp. In the WiredTigerCheckpointThread, we will take a stable checkpoint instead of taking a full checkpoint. Since we are no longer taking full checkpoints, we may need to pin oplog. Since the stable timestamp is now ahead of the majority commit point, we still must use rollback-via-refetch instead of recover-to-stable-timestamp.


 Comments   
Comment by Githook User [ 10/Dec/18 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: SERVER-37897 Disable table logging for data files when enableMajorityReadConcern=false
Branch: master
https://github.com/mongodb/mongo/commit/6f23d1a20a7669396efab06bf26b3ee76553fd9b

Comment by Eric Milkie [ 12/Nov/18 ]

Regarding my confusion over

I imagined that we would call setStableTimestamp() from the oplog manager (which also sets the oldest timestamp). Does that make sense?

I have edited the Description to suggest that we set both the stable and oldest timestamp in the oplog manager.

Comment by William Schultz (Inactive) [ 09/Nov/18 ]

To clarify the discussion with daniel.gottlieb about keeping a "window" of history, I was only concerned with keeping a small bit of "snapshot window" history, so as to service sharded atClusterTime reads more easily. When majority reads are enabled, the storage timeline (i.e. the oplog) looks something like the following, with

O=oldest timestamp
S=stable timestamp
M=majority committed timestamp
A=all committed timestamp
L=last applied timestamp
O-------S----M---------------------------------------A----L

The interval between [O,S] is not strictly necessary to maintain, but we let the oldest_timestamp lag a bit behind the stable_timestamp so that we could service reads in this region of history if necessary. The amount of lag was calculated and determined by these parameters. When majority reads are disabled, we no longer keep history all the way back to the majority commit point i.e. we will no longer be restricting the stable timestamp to fall behind the majority commit point. The timeline may now look something like this:

-------------M--------------------------O--------S---A----L

Depending on the implementation, the stable timestamp (S) may end up always being set to the all committed timestamp (A). Although these diagrams cannot be considered "to scale", they try to illustrate the issue of setting the stable timestamp behind the majority commit point: there is a lot of history that needs to be kept between the oldest timestamp and the most recent last applied timestamp (the [O,L] interval in the first diagram). If we start setting the stable timestamp to something much closer to lastApplied/all committed, there is no need to keep all of this history. We still, however, want to keep a "snapshot window", which is the (much smaller) amount of history we keep around between the oldest timestamp and the stable timestamp. This will allow us to service snapshot reads that may lag a bit behind the lastApplied timestamp. The "snapshot window" history is what we are interested in maintaining as a part of this ticket, which we may just get for free, since we already calculate a window of lag when we set the stable timestamp in WiredTiger.

Comment by William Schultz (Inactive) [ 08/Nov/18 ]

daniel.gottlieb I wasn't referring to the history we keep all the way back to the majority commit point. I was just referring to the small window of history we keep behind the stable timestamp, as dictated by these parameters.

Comment by Daniel Gottlieb (Inactive) [ 06/Nov/18 ]

Ideally, as a part of this change, we would maintain a window of history behind the stable timestamp, as we do already when majority reads are enabled.

To my knowledge, the problems surrounding memory usage and history, particularly when there's a problem moving the commit point, are dictating that the purpose of this project/ticket are to not keep history to the majority point. I don't believe replication/sharding will be able to expect any meaningful history in storage with enableMajorityReadConcern: false.

Comment by William Schultz (Inactive) [ 06/Nov/18 ]

Ideally, as a part of this change, we would maintain a window of history behind the stable timestamp, as we do already when majority reads are enabled. This is necessary to make single shard transactions against enableMajorityReadConcern:false nodes work in a sensible way. Depending on how we implement the changes for this ticket, we may get this for free, as a result of the existing logic in WiredTigerKVEngine::setStableTimestamp. If this is not possible as part of this ticket, we should make sure to do it separately.

Comment by Tess Avitabile (Inactive) [ 05/Nov/18 ]

I imagined that we would call setStableTimestamp() from the oplog manager (which also sets the oldest timestamp). Does that make sense?

Comment by Eric Milkie [ 04/Nov/18 ]

This plan sounds a lot better than any of my crazy ideas for this problem.
Q: If we stop setting the oldest timestamp in the oplog manager, where will we set it instead? I suppose we could have the checkpoint thread set it, since it will now become the thing that is consuming the oldest history?

Comment by Tess Avitabile (Inactive) [ 02/Nov/18 ]

milkie, this is the solution that daniel.gottlieb and I agreed on for causally consistent sharded cluster backups. I'm not sure whether this work belongs on the Replication team or the Storage team. I think it could reasonably be done by either team.

Generated at Thu Feb 08 04:47:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.