-
Type:
Improvement
-
Resolution: Fixed
-
Priority:
Critical - P2
-
Affects Version/s: None
-
Component/s: Cache and Eviction, Layered Tables
-
None
-
Storage Engines, Storage Engines - Transactions
-
SE Transactions - 2025-10-10
-
3
During internal testing of disaggregated storage, we have hit several issues where a standby node running in follower mode has stalled due to cache pressure.
The typical scenario is that oplog application inserts/updates records that land in the ingest tables on the follower node. Because the follower can't write to shared storage, this dirty data has to remain in cache until it can either be pruned (after picking up a new checkpoint) or until it can be written into the shared table (after stepping up to become the primary/leader).
But we currently use the same cache eviction targets and triggers in follower mode and leader mode, despite the fact that we can't evict dirty data from a follower. This means that inserting records equal to 10% of the cache (the default update trigger) will cause a follower to stall – all of the oplog applier threads get pulled in to help with eviction, but we can't evict so they get stuck, and the node effectively hangs.
This ticket is intended as a short term fix to enable more standby testing. For now, we should not try to evict dirty or update content on a follower node. This risks filling (or overfilling) the cache with dirty data. So we should also add a failure mode where we panic if the cache is full of dirty data – better to have a clear failure with a clear cause than to have the system mysteriously hang.
Definition of done:
- Application threads are not used to help with dirty or update eviction when WiredTiger is in follower mode.
- This could be implemented by dynamically adjusting the trigger/target values when the system switches between follower and leader modes. or by changing the checks for using application threads for eviction, or something else.
- There are no changes to clean eviction behavior. WT should still evict clean pages if the cache is full.
- If the cache has a large amount of dirty or update content (95%?) WT should log a clear message about the problem and panic.
Note: the long term fix here isn't obvious – there are a variety of ways we could relieve pressure on the follower – but they also have downstream consequences for things like failover time and the efficiency of ingest draining during checkpoint pickup. Hence this ticket to allow more testing while we consider more holistic solution.
- is related to
-
WT-15192 Incorrect comparison between local table and metadata checkpoint orders during pruning
-
- In Code Review
-
-
WT-15608 Aggregated timestamp validation can fail with a 0 timestamped page deleted structure
-
- Closed
-
-
WT-15596 Don't review obsolete time window for readonly btree
-
- Closed
-
-
WT-15616 dist/s_all pass with failed s_test_suite_no_executable check
-
- Closed
-
-
WT-15626 Fix RTS verbose time window output
-
- Closed
-
-
WT-15622 Skip eviction walk on read only btrees if we are only look for dirty and updates data
-
- Closed
-
-
WT-10374 test_fops dirty leaf page count went negative
-
- Open
-
- related to
-
WT-15041 Handle abandoned checkpoints in PALM
-
- Closed
-
-
WT-15534 Enable timestamp usage check for fast truncate on non-standalone build
-
- Closed
-
-
WT-15634 failed: s-outdated-fixmes on infrequent-checks [wiredtiger @ fbae136d]
-
- Closed
-