[SERVER-83186] Writers can get stuck in cache eviction without rolling back transactions in order to make progress Created: 13/Nov/23  Updated: 14/Nov/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gregory Noma Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-64982 Extended lack of availability caused ... Blocked
related to SERVER-44870 MongoDB to allow WiredTiger to return... Backlog
related to SERVER-61251 Ensure long running storage engine op... Backlog
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

For a workload with many threads performing medium-sized writes (unlike SERVER-61909 where there is a single large transaction which cannot fit in cache), especially one with prepared transactions since those cannot be rolled back, we can end up in a scenario where eviction cannot make meaningful progress. The threads performing the writes are being co-opted into performing eviction and cannot return until sufficient eviction progress is made, so they can essentially all become stuck as a result.



 Comments   
Comment by Louis Williams [ 13/Nov/23 ]

This is not the same problem as SERVER-51281, but it does have the same symptoms. It's a system-wide stall where all threads are stuck in cache eviction. SERVER-51281 was caused by the fact that we were using multiple WT sessions on the same thread, as described more generally by SERVER-61116.

This ticket does not describe a full stall that grinds the system to a halt. The system is still making progress, just very slowly. There are a set of problems that culminate in this behavior. Firstly, once operations are stuck inside WiredTiger, we can't kill them. The other problem is our load-shedding in this state is very limited.

Generated at Thu Feb 08 06:51:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.