Major - P3
Storage - Ra 2022-01-24, Storage - Ra 2022-02-07
As per the comments on BF-23823, WiredTiger can hit a cache stuck scenario by with the following steps:
1. A transaction starts and commits a value or multiple values larger than cache. However it doesn't yet get pulled into eviction
2. The transaction calls prepare.
3. The page associated with the values is evicted as it is too large for cache.
4. Upon committing the prepared transaction WiredTiger will attempt to resolve the updates, in doing so it reads the page back into cache which again puts it over the limit. It then will get pulled back into eviction, however it cannot evict its own updates as they will appear to be uncommitted.
5. The transaction will spin trying to evict the page it read in which causes the hang.
WT-8643_replicator.py has been attached to this ticket.
We should investigate whether it is necessary to evict pages re-instantiated by the application thread when committing a prepared transaction, or if this change will negatively impact WiredTiger behaviour, and if it needs to be backported to older releases.
Does this affect any team outside of WT?
Yes, this can trigger cache stuck scenarios.
How likely is it that this use case or problem will occur?
If the problem does occur, what are the consequences and how severe are they?
This can lead to cache stuck issues where mongod hangs indefinitely unless operation_timeout_ms is set.
Is this issue urgent?
Acceptance Criteria (Definition of Done)
Determine whether the above change can be made, and if so:
- Make the change
- The provided replicator no longer hits a cache stuck when run.
- The change is also backported to all required branches.
Use the attached replicator.