[SERVER-31887] clusterTime advanced on primary without anything being written to oplog Created: 09/Nov/17 Updated: 27/Oct/23 Resolved: 20/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.0-rc2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Misha Tyulenev |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
Output
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2019-07-01 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 60 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Command errors such as duplicate key errors still cause the clusterTime to be advanced; however, no oplog entry is written for those failed operations. We then use the current clusterTime as the operationTime in the server's error responses for other error cases (as part of the changes from The behavior of advancing the clusterTime prior to RecoveryUnit::Change::commit() (e.g. for cases where the operation fails on WiredTiger transaction commit) may have other negative effects on causal consistency that I haven't thought about. The impact of this bug insofar as I have observed failures in Evergreen is that we no longer have a guarantee of making forward progress when waiting for afterClusterTime unless the periodic no-op writer is running. |
| Comments |
| Comment by Misha Tyulenev [ 20/Jun/19 ] |
|
While the reported behavior, i.e. not incrementing optime on error, persists. and it is a "work as designed" the command does not hang. |
| Comment by Misha Tyulenev [ 22/Nov/17 ] |
|
I moved this ticket to our backlog. The fix is related to transactions as in this case it may require LogicalClock to be a part of the transaction. |
| Comment by Misha Tyulenev [ 09/Nov/17 ] |
|
Returning the clusterTime that is rolled back is a non-blocking bug as we have a way to make progress in prod. |
| Comment by Max Hirschhorn [ 09/Nov/17 ] |
Correct, I still find it odd that the server would be allowed to report an operationTime for which calling waitForAllEarlierOplogWritesToBeVisible() and then reading directly from the oplog wouldn't be guaranteed have an entry at least as new. I think one way to get around the liveness issue when the periodic no-op writer is disabled might be to have failed operations use the last operationTime that was returned to that Client rather than using the current clusterTime. |
| Comment by Kaloian Manassiev [ 09/Nov/17 ] |
|
I am not very familiar with how this works so I will leave Misha to comment. However, the way I imagine it is that the find against the primary never completes, because it is waiting for the persisted logical time to advance up to the one specified in the read concern's afterClusterTime (which in turn is equal to the in-memory logical time). In such a case ideally the logical time which was returned for rolled-back request should be the one which is persisted instead of the in-memory one. However I am not sure how easy that is to do, so always running the no-op writer is the closest thing available. max.hirschhorn, you said that "the periodic no-op writer is already enabled by default". This means that this blocking only impacts tests, but it won't be hit in practice, right? |
| Comment by Max Hirschhorn [ 09/Nov/17 ] |
kaloian.manassiev, the periodic no-op writer is already enabled by default. Prior conversations with Misha around how the secondary will send the "appendOplogNote" command to the node it currently believes to be primary lead me to believe that the periodic no-op writer wasn't intended to be necessary to ensure forward progress. We recently decided to disable it in more test suites ( |
| Comment by Kaloian Manassiev [ 09/Nov/17 ] |
|
In the scenario where the WT transaction rolls back, wouldn't it only be the in-memory cluster time which is advanced, whereas the persisted one would remain at whatever the last oplog entry written was? In this case I agree that that oplog writer should run at some point in order to bring the persisted up to the in-memory one. I don't think we can make the optime only be advanced at the WUOW commit, because if it rolls back, we don't know what to roll it back to (i.e., in the presence of multiple concurrent writes). misha.tyulenev - while |
| Comment by Max Hirschhorn [ 09/Nov/17 ] |
I've renamed the title of this ticket from "operationTime can move backwards (even when no operations are rolled back via replication)". I suspect it is permissible for the server to report that the operationTime time moved backwards when a command only "happened after" from the client's perspective because it didn't actually specify afterClusterTime (e.g. as was the case before with the "ping" command). I've changed the steps to reproduce to demonstrate a "find" command for which waitForReadConcern() wouldn't return because the server has advanced its clusterTime without reflecting it in the oplog. |