[SERVER-61116] Audit and add assertions against using multiple WT_SESSIONs on the same thread Created: 29/Oct/21  Updated: 27/Feb/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: techdebt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to WT-8245 Fix eviction hang during importCollec... Closed
related to WT-9330 Add observability on the last thread ... Open
related to SERVER-64856 Explore reusing the caller's WT_SESSI... Closed
is related to SERVER-67514 SizeStorer load() can get stuck in pa... Closed
is related to SERVER-61097 SizeStorer can cause deadlocks with c... Closed
is related to SERVER-62650 RecordStore RecordId initialization c... Closed
is related to WT-9035 Asynchronously roll back transactions... Closed
is related to WT-7203 Add WT diagnostic mode test for confl... Backlog
is related to WT-8864 Document operations that can timeout ... Closed
Assigned Teams:
Storage Execution
Sprint: Execution Team 2022-02-21, Execution Team 2022-03-07
Participants:
Linked BF Score: 20

 Description   

In order to guarantee cache eviction progress, WiredTiger requires MongoDB to use one WT_SESSION per thread. This is because transactions can only be rolled-back for eviction when API calls are made into the session. When each thread has only one session, then WiredTiger can guarantee forward eviction progress without blocking, because all operations will eventually make API calls, which allows them to be rolled-back if they are blocking eviction.

Using more than one session per thread risks the following deadlock:

  • Operation writes using session S1
  • Operation reads using session S2
  • S2 is blocked on cache eviction. S1 is the oldest transaction that is pinning content, and therefore needs to be rolled back. Because this operation is not actively making calls into S1, we reach a deadlock

Note that this is only a problem when a read-only session is used while also holding onto a session that has performed writes. This is not problematic with two writing sessions, two reading sessions, or when a writing session holds onto a read-only session.

Edit: my previous claim that this is only a problem with read-only sessions is incorrect. Every session that wishes to write must first open a cursor, which involves a cache eviction check. So the deadlock scenario is still possible.

We should audit and make assertions that an operation in a WriteUnitOfWork (i.e. a write transaction) cannot open any new sessions. For cases where happens, we should find a way to stop, or add the "cache_max_wait_ms" option to allow the operation to time out.

 



 Comments   
Comment by Josef Ahmad [ 24/Mar/22 ]

I've forked the getLatestOplogTimestamp investigation into SERVER-64856.

Comment by Josef Ahmad [ 17/Mar/22 ]

An addition regarding the last bullet point: daniel.gottlieb ran an experiment which didn't seem to detect any call site of getLatestOplogTimestamp with an open transaction. We should investigate whether we actually need to call getLatestOplogTimestamp for write operations, as in principle writes assign a timestamp to the WUOW, so it should be sufficient for the caller to do any waiting for that time.

Comment by Josef Ahmad [ 14/Mar/22 ]

Summary of the investigation.

  • The main follow-up item as already described in the previous comment is revisiting the ability to change the recovery unit from the OperationContext API, which is at odds with the WiredTiger's requirement that MongoDB use one WT_SESSION per thread.
  • In a minority of other cases, we create a new session without swapping recovery units. Some examples are WiredTigerIndex::Create and WiredTigerKVEngine::createRecordStore. In these cases, setting a "cache_max_wait_ms" timeout on the additional session does not generally appear to be a viable mitigation, expect maybe for specific scenarios like SERVER-62650. Per WT-8864, virtually any WiredTiger API call could return WT_CACHE_FULL. It doesn't seem viable to universally up-convert a cache read timeout to a write conflict, as that would at a minimum be surprising behaviour for operations that only expect to do reads, or operations that run in isolation by design (e.g. some startup and maintenance sequences).
  • As a special case to the previous bullet point, we've identified a possible way to avoid using a second session in getLatestOplogTimestamp. Because the oplog is a logged table, and because per WT-8601 logged tables are not timestamped, we would potentially be able to reuse the existing recovery unit's session. However doing so would change some of the current visibility behaviour and include uncommitted writes on the operation context. We should determine whether there are consumers relying on this behaviour - seems that waiting for write concern could be potentially affected.

 

Comment by Josef Ahmad [ 11/Feb/22 ]

One complication is there are features that are designed to hold more than one WT session per thread at a given time, like prepared transactions. As daniel.gottlieb pointed out to me, we allow changing the recovery unit from the OperationContext API which can be used to interleave multiple transactions. Enabling "cache_max_wait_ms" timeouts in these cases could turn out to be quite involved, and introduce new failure modes.

We can probably address the simpler cases in the context of this ticket by specifying the "cache_max_wait_ms" timeout, and defer the rest to another ticket. Also, it mightn't be straightforward to add an invariant at this time, due to the number of exceptions to the rule we wish to enforce.

Generated at Thu Feb 08 05:51:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.