[SERVER-61116] Audit and add assertions against using multiple WT_SESSIONs on the same thread Created: 29/Oct/21 Updated: 27/Feb/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | techdebt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2022-02-21, Execution Team 2022-03-07 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 20 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
In order to guarantee cache eviction progress, WiredTiger requires MongoDB to use one WT_SESSION per thread. This is because transactions can only be rolled-back for eviction when API calls are made into the session. When each thread has only one session, then WiredTiger can guarantee forward eviction progress without blocking, because all operations will eventually make API calls, which allows them to be rolled-back if they are blocking eviction. Using more than one session per thread risks the following deadlock:
Edit: my previous claim that this is only a problem with read-only sessions is incorrect. Every session that wishes to write must first open a cursor, which involves a cache eviction check. So the deadlock scenario is still possible. We should audit and make assertions that an operation in a WriteUnitOfWork (i.e. a write transaction) cannot open any new sessions. For cases where happens, we should find a way to stop, or add the "cache_max_wait_ms" option to allow the operation to time out.
|
| Comments |
| Comment by Josef Ahmad [ 24/Mar/22 ] |
|
I've forked the getLatestOplogTimestamp investigation into |
| Comment by Josef Ahmad [ 17/Mar/22 ] |
|
An addition regarding the last bullet point: daniel.gottlieb ran an experiment which didn't seem to detect any call site of getLatestOplogTimestamp with an open transaction. We should investigate whether we actually need to call getLatestOplogTimestamp for write operations, as in principle writes assign a timestamp to the WUOW, so it should be sufficient for the caller to do any waiting for that time. |
| Comment by Josef Ahmad [ 14/Mar/22 ] |
|
Summary of the investigation.
|
| Comment by Josef Ahmad [ 11/Feb/22 ] |
|
One complication is there are features that are designed to hold more than one WT session per thread at a given time, like prepared transactions. As daniel.gottlieb pointed out to me, we allow changing the recovery unit from the OperationContext API which can be used to interleave multiple transactions. Enabling "cache_max_wait_ms" timeouts in these cases could turn out to be quite involved, and introduce new failure modes. We can probably address the simpler cases in the context of this ticket by specifying the "cache_max_wait_ms" timeout, and defer the rest to another ticket. Also, it mightn't be straightforward to add an invariant at this time, due to the number of exceptions to the rule we wish to enforce. |