[SERVER-39364] Audit uses of setLastOpToSystemLastOpTime Created: 04/Feb/19  Updated: 29/Oct/23  Resolved: 15/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-43656 Return the timestamp of the top of th... Closed
is depended on by SERVER-41165 Transaction participants don't need t... Backlog
Problem/Incident
causes SERVER-45800 Unlock RSTL after calling setLastOpTo... Closed
Related
related to SERVER-45598 Complete TODO listed in SERVER-39364 Closed
related to SERVER-44820 bgSync grabs DBLock in X mode when re... Closed
is related to SERVER-30217 applyOps doesn't wait for replication... Backlog
is related to SERVER-37948 Linearizable read concern is not sati... Closed
is related to SERVER-39356 Change stream updateLookup queries wi... Closed
is related to SERVER-38906 Multi-document transactions should no... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-12-30, Repl 2020-01-13, Repl 2020-01-27
Participants:
Linked BF Score: 20

 Description   

It is not valid to call setLastOpToSystemLastOpTime() and then wait for majority writeConcern to ensure that any data read was majority committed. The reason is that writes update the lastApplied in an on-commit hook after the data changes are made. Consider the following scenario. Originally the lastApplied is time 1. Operation A updates the document {_id: 0} to have a=5 and writes an oplog entry with time 2. Operation B reads the document {_id: 0, a: 5}.  Operation B calls setLastOpToSystemLastOpTime() and gets the lastApplied time as time 1. Then the on-commit hook for Operation A updates the lastApplied to time 2. Operation B will wait for the majority commit point to reach time 1, which does not guarantee that the data that it read was majority committed.

Any callers of setLastOpToSystemLastOpTime() that use it to ensure data read was majority committed are suspect. This excludes reading transaction state in transaction operations, since they are serialized by the session checkout mechanism, so a new transaction operation cannot start until the on-commit hooks have finished for the previous operation.



 Comments   
Comment by Githook User [ 15/Jan/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-39364: Get the last oplog entry from the storage engine for setLastOpToSystemLastOpTime
Branch: master
https://github.com/mongodb/mongo/commit/756a87f2e86d8f67259b5995d8f1cf7dcc27f7a6

Comment by Tess Avitabile (Inactive) [ 17/Dec/19 ]

Your plan sounds great. Thanks for the thorough investigation. For transactions, you can repurpose SERVER-41165 to remove the noop write. I agree that the solution for change streams using timestamped reads does not need to change, and that the performance of linearizable is lower priority.

Comment by Judah Schvimer [ 16/Dec/19 ]

sgtm

Comment by Lingzhi Deng [ 16/Dec/19 ]

Talked with judah.schvimer offline, in this ticket, we will re-implement ReplClientInfo::setLastOpToSystemLastOpTime to use WiredTigerRecordStore::getLatestOplogTimestamp implemented in SERVER-43656. To address jason.chan comment, we will only grab global MODE_IS lock when accessing the oplog collection and the record store. We can do that via LocalOplogInfo.

And once the (re)implementation is done, I will audit callers of setLastOpToSystemLastOpTime and file tickets to remove unnecessary noop oplog writes that we did in the past for correctness in SERVER-38906. SERVER-39356 was done using timestamped reads for change streams. So I think we could leave that as is. And SERVER-37948 is only for linearizable and it is probably less value. So we could leave that too?

tess.avitabile, judah.schvimer How does the plan sound to you? Is there anything else that I need to be careful with?

Comment by Jason Chan [ 05/Dec/19 ]

SERVER-43656 implements a method to get the lastAppliedTimestamp through the record store. This requires getting the collection object (and the appropriate collection level locks). This ticket may need to extend the new mechanism added to get the lastAppliedTimestamp while only holding the global lock in MODE_IS. Once this is available, consider replacing Helpers::getLast in bgsync.cpp to use the new interface.

Comment by Judah Schvimer [ 26/Sep/19 ]

I filed SERVER-43656 to track the storage integration layer work needed for this.

Comment by Daniel Gottlieb (Inactive) [ 26/Sep/19 ]

SGTM

Comment by Judah Schvimer [ 25/Sep/19 ]

Talking to daniel.gottlieb, made me realize that retryable writes are immune to this bug because of session checkout, like transactions. Other "retryable write like" writes that use setLastOpToSystemLastOpTime are still subject to this bug though. setFCV is the simplest example. Sharding also makes use of this in places that, I think, do not always do storage reads.

Comment by Judah Schvimer [ 25/Sep/19 ]

I'm resuming work on this ticket.

2. WT will error if MongoDB does a read at a timestamp greater than the all_committed.

agorrod, was this implemented?

[3.]2. WT could return the largest timestamp of any update seen in a given WT-transaction.

There are two main uses of setLastOpToSystemLastOpTime.

  1. Ensuring clients wait for write concern on any writes that could have caused their attempted write to become a "no-op write".
  2. Ensuring that when clients retry a retryable write (or some other command that is retryable), that they wait for write concern on the original write.

In the first case there is a WT transaction available, though it has been closed by the time we call setLastOpToSystemLastOpTime. We could expose a way to store the "largest timestamp of any update seen" (maybe on the OperationContext) until it is ready for use. This does not help in the second case though, for retryable writes, where there is no WT transaction, there is just in-memory state indicating there is no work to do. In some cases we have an OpTime to wait on, but in others there's no clear OpTime to wait on.

I propose to fix this with the other solution that "The storage integration layer could return the timestamp of the top of the oplog." We could optimize this in the future for certain cases, but correctness is the problem here, not performance. The other solution just had some potential performance improvements we won't be able to get. daniel.gottlieb, does this still seem reasonable?

Comment by Tess Avitabile (Inactive) [ 12/Feb/19 ]

That sounds correct to me. Thank you for writing up the plan, judah.schvimer.

Comment by Judah Schvimer [ 11/Feb/19 ]

As discussed with tess.avitabile, schwerin, michael.cahill, agorrod, siyuan.zhou, and daniel.gottlieb, the ideal fix will be:

  1. MongoDB will do untimestamped reads for local and majority read concerns (SERVER-38906).
  2. WT will error if MongoDB does a read at a timestamp greater than the all_committed.
  3. MongoDB will ask WT for the timestamp it should use instead of lastApplied in setLastOpToSystemLastOpTime (and possibly change the name of the function if appropriate). This could be one of two things:
    1. The storage integration layer could return the timestamp of the top of the oplog. This should be quite cheap since the timestamp is the RecordId in the oplog and should not require parsing any BSON. This is what is intended by the current broken mechanism.
    2. WT could return the largest timestamp of any update seen in a given WT-transaction. This timestamp is being used when waiting for data to be majority committed that's causally related to a user request. Thus, we only care about the timestamps of updates that the WT-transaction in the request actually saw. This will reduce the amount we have to wait in many cases. We currently ensure that majority reads read data from previously acknowledged majority writes on the same node and connection. To maintain this property, WT will use the durable_timestamp rather than the commit_timestamp of the updates. This prevents the commitTransaction oplog entry from rolling back and an ignore_prepared=true majority read from missing data that was previously read with speculative majority read concern or written with write concern majority (though we are still guaranteed that the transaction will commit at the same commit_timestamp sometime in the future.

 

If I missed anything or incorrectly summarized the results of the discussion, please correct me.

Comment by Daniel Gottlieb (Inactive) [ 05/Feb/19 ]

I agree that, on primaries, "last reserved" is effectively the logical clock value. The challenge is on (lagged) secondaries where their logical clock values will know of times arbitrarily far into the future. Because this time is typically used to "wait until", assigning such a time to stale data would result in unnecessary waiting.

Comment by Eric Milkie [ 04/Feb/19 ]

The "last reserved optime" seems like an easy route to take; you can just call LogicalClock::getClusterTime(). That works at least on primary nodes.

Comment by Tess Avitabile (Inactive) [ 04/Feb/19 ]

Discussed this with daniel.gottlieb, and we could consider changing these uses to wait on the clusterTime being majority committed or introduced a new notion of the "last reserved optime".

Generated at Thu Feb 08 04:51:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.