[SERVER-39364] Audit uses of setLastOpToSystemLastOpTime Created: 04/Feb/19 Updated: 29/Oct/23 Resolved: 15/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.3 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2019-12-30, Repl 2020-01-13, Repl 2020-01-27 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 20 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
It is not valid to call setLastOpToSystemLastOpTime() and then wait for majority writeConcern to ensure that any data read was majority committed. The reason is that writes update the lastApplied in an on-commit hook after the data changes are made. Consider the following scenario. Originally the lastApplied is time 1. Operation A updates the document {_id: 0} to have a=5 and writes an oplog entry with time 2. Operation B reads the document {_id: 0, a: 5}. Operation B calls setLastOpToSystemLastOpTime() and gets the lastApplied time as time 1. Then the on-commit hook for Operation A updates the lastApplied to time 2. Operation B will wait for the majority commit point to reach time 1, which does not guarantee that the data that it read was majority committed. Any callers of setLastOpToSystemLastOpTime() that use it to ensure data read was majority committed are suspect. This excludes reading transaction state in transaction operations, since they are serialized by the session checkout mechanism, so a new transaction operation cannot start until the on-commit hooks have finished for the previous operation. |
| Comments |
| Comment by Githook User [ 15/Jan/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Tess Avitabile (Inactive) [ 17/Dec/19 ] |
|
Your plan sounds great. Thanks for the thorough investigation. For transactions, you can repurpose SERVER-41165 to remove the noop write. I agree that the solution for change streams using timestamped reads does not need to change, and that the performance of linearizable is lower priority. |
| Comment by Judah Schvimer [ 16/Dec/19 ] |
|
sgtm |
| Comment by Lingzhi Deng [ 16/Dec/19 ] |
|
Talked with judah.schvimer offline, in this ticket, we will re-implement ReplClientInfo::setLastOpToSystemLastOpTime to use WiredTigerRecordStore::getLatestOplogTimestamp implemented in And once the (re)implementation is done, I will audit callers of setLastOpToSystemLastOpTime and file tickets to remove unnecessary noop oplog writes that we did in the past for correctness in tess.avitabile, judah.schvimer How does the plan sound to you? Is there anything else that I need to be careful with? |
| Comment by Jason Chan [ 05/Dec/19 ] |
|
|
| Comment by Judah Schvimer [ 26/Sep/19 ] |
|
I filed |
| Comment by Daniel Gottlieb (Inactive) [ 26/Sep/19 ] |
|
SGTM |
| Comment by Judah Schvimer [ 25/Sep/19 ] |
|
Talking to daniel.gottlieb, made me realize that retryable writes are immune to this bug because of session checkout, like transactions. Other "retryable write like" writes that use setLastOpToSystemLastOpTime are still subject to this bug though. setFCV is the simplest example. Sharding also makes use of this in places that, I think, do not always do storage reads. |
| Comment by Judah Schvimer [ 25/Sep/19 ] |
|
I'm resuming work on this ticket.
agorrod, was this implemented?
There are two main uses of setLastOpToSystemLastOpTime.
In the first case there is a WT transaction available, though it has been closed by the time we call setLastOpToSystemLastOpTime. We could expose a way to store the "largest timestamp of any update seen" (maybe on the OperationContext) until it is ready for use. This does not help in the second case though, for retryable writes, where there is no WT transaction, there is just in-memory state indicating there is no work to do. In some cases we have an OpTime to wait on, but in others there's no clear OpTime to wait on. I propose to fix this with the other solution that "The storage integration layer could return the timestamp of the top of the oplog." We could optimize this in the future for certain cases, but correctness is the problem here, not performance. The other solution just had some potential performance improvements we won't be able to get. daniel.gottlieb, does this still seem reasonable? |
| Comment by Tess Avitabile (Inactive) [ 12/Feb/19 ] |
|
That sounds correct to me. Thank you for writing up the plan, judah.schvimer. |
| Comment by Judah Schvimer [ 11/Feb/19 ] |
|
As discussed with tess.avitabile, schwerin, michael.cahill, agorrod, siyuan.zhou, and daniel.gottlieb, the ideal fix will be:
If I missed anything or incorrectly summarized the results of the discussion, please correct me. |
| Comment by Daniel Gottlieb (Inactive) [ 05/Feb/19 ] |
|
I agree that, on primaries, "last reserved" is effectively the logical clock value. The challenge is on (lagged) secondaries where their logical clock values will know of times arbitrarily far into the future. Because this time is typically used to "wait until", assigning such a time to stale data would result in unnecessary waiting. |
| Comment by Eric Milkie [ 04/Feb/19 ] |
|
The "last reserved optime" seems like an easy route to take; you can just call LogicalClock::getClusterTime(). That works at least on primary nodes. |
| Comment by Tess Avitabile (Inactive) [ 04/Feb/19 ] |
|
Discussed this with daniel.gottlieb, and we could consider changing these uses to wait on the clusterTime being majority committed or introduced a new notion of the "last reserved optime". |