[SERVER-71674] Yield handlers not handling WCE thrown from restoring cursors Created: 29/Nov/22  Updated: 29/Nov/23

Status: Blocked
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yuhong Zhang Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-11871 Define whether or not cursor reset ca... Open
Related
related to SERVER-26452 `renameCollection` should handle writ... Closed
related to SERVER-65106 Don't perform eviction after prepared... Closed
related to SERVER-37861 Add writeConflictRetry to wildcard in... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Execution Team 2023-02-20, Execution Team 2023-02-06, Execution Team 2023-03-06
Participants:
Linked BF Score: 6

 Description   

In some query stages like delete, we will need to call restoreState() in case the actual delete throws a WCE which invalidates the cursor. But restoreState() can also throw a WCE which is not handled. This can cause the following next() call lands the cursor to the beginning of the collection again, hitting the out-of-order invariant.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 27/Feb/23 ]

I'm going to put this ticket back into Needs Triage to gain wider support for either

(a) removing the failpoint testing for artificial read path WCEs
(b) leaving this problem alone until such time that work is scheduled to handle WCEs in the read paths.

A pro of keeping the testing active would be to catch new use cases. Although it's unclear to me how useful the testing currently is in that regard, since we clearly have scenarios and they are rarely caught. Nor does the failure signal help us with the final solution (complete read path WCE handling).

A con of keeping the testing active is it could lead to more crazy long investigations when the source of the failure is very non-obvious.

Comment by Yuhong Zhang [ 21/Feb/23 ]

I pointed Dianna to the wrong failpoint. The one that throws WCEs is actually here

I'm unlinking SERVER-69101 because it's actually related to a different failpoint WTPrepareConflictForReads. A somewhat similar name

Comment by Dianna Hohensee (Inactive) [ 30/Jan/23 ]

Linking SERVER-69101 because it discusses the failpoint in question.

Comment by Dianna Hohensee (Inactive) [ 20/Jan/23 ]

I spoke offline with Yuhong.

The idea here is that any query stage ::restoreState() function call, like this one where the WCE arose in the linked test failure , can potentially throw a WCE. This happens because after an initial WCE A causes the query stage to request a yield, the query restore state logic ultimately calls WT::search_near() to reposition the cursor that was unpositioned by WCE A and can throw a new WCE B. This is the testing-only failpoint to randomly throw artificial WCEs. Because any WT read in theory could throw a WCE.

However, I don't believe WT ever does throw a WCE for a read today. It only happens artificially in the MDB layer for testing purposes. And I don't believe query code today is prepared to deal with a WCE. And that is what is happening in our test suite artificially, and the motivation for this ticket. Therefore the options for this ticket are,

(a) we start supporting WCE for reads in the query layer.

On first thought for (a), we retry the WCE someplace in the query operation stack, another WCE retry loop around the query ::restoreState() calls, or inside said function call. In theory an operation can throw WCEs endlessly, but that's true today for all WCE handling.

(c) we modify our testing not to throw artificial WCEs in the query read path / for search_near() calls.

Comment by Fausto Leyva (Inactive) [ 29/Nov/22 ]

When this ticket is addressed, we should request QE for a code review. 

Generated at Thu Feb 08 06:19:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.