-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Cache and Eviction
-
Storage Engines - Foundations
-
538.56
-
None
-
None
I am trying to create a DSI task in SERVER-119445 to do repeated failovers on mongod.
I am running into a situation where sometimes, a step up will hang here when committing the WUOW to insert the step-up no-op entry into the oplog.
I have this patch here with a repro and additional PALI/WT logs, and there are stack traces in the log (for mongod.1, in mongod.log) (attached to this ticket, but you can also download all artifacts from DSI artifacts here
I'm not sure if this should go to WT or PALI but here is an initial AI analysis of the stack trace:
Summary
5 SIGUSR2 dumps were captured. The step-up thread is stuck in the exact same place across all 5 dumps spanning ~82 seconds. It's still stuck at the end of the log (555+ seconds).
Timeline
| Time | Event |
|---|---|
| 00:43:09 | mongod starts (secondary, after restart) |
| 00:55:57 | Step-up WUOW commit begins on thread Disagg-12 (tid 17017) |
| 00:57:18 | 1st SIGUSR2 (81s into hang) |
| 00:57:39 | 2nd SIGUSR2 (102s) |
| 00:57:59 | 3rd SIGUSR2 (122s) |
| 00:58:20 | 4th SIGUSR2 (143s) |
| 00:58:40 | 5th SIGUSR2 (163s) |
| 01:05:12 | Log ends — still stuck at 555s (9+ minutes) |
The Stuck Thread
Thread Disagg-12 (tid 17017) — the step-up worker — has an identical stack in all 5 dumps:
finalizeStepUp
→ WriteUnitOfWork::commit
→ WiredTigerRecoveryUnit::doCommitUnitOfWork
→ __wt_txn_commit
→ __wti_evict_app_assist_worker ← STUCK HERE
→ __wt_cond_wait_signal
→ pthread_cond_timedwait
What This Means
During __wt_txn_commit(), WiredTiger detected cache pressure and forced the step-up thread to become an eviction assist worker — a mechanism where application threads help evict dirty pages from cache when the dedicated eviction threads can't keep up. The thread is then stuck waiting on a condition variable for eviction to make progress.
Why It Can't Make Progress
The critical observation is that the dedicated eviction server threads are idle (both eviction-ser 1 and eviction-ser 4 are sitting in their idle wait loops). Yet the app-assist thread is stuck waiting for eviction progress. This suggests:
The eviction servers have no pages they can evict that meet the current policy, while the app-assist thread is blocked waiting for cache levels to drop.
In a disagg setup, evicting dirty pages means writing them through PALI to page servers. If there's some condition preventing eviction writes (e.g., the Global X lock held by this same step-up thread is blocking something the PALI write path needs), you get a self-deadlock/livelock: the thread holds Global X → needs to commit → WT says "help with eviction first" → eviction needs something blocked by Global X → stuck forever.
Alternatively, if the eviction servers genuinely believe there's nothing to evict, but the cache pressure flag is still set from before, the app-assist worker will spin-wait on a condition that never resolves.
Other Threads of Interest
- Disagg-12 (tid 17202): Just the monitoring thread sleeping and logging "WUOW commit still in progress" every 5s
- conn4/conn5: DSI probe connections — conn4 is in poll() waiting, conn5 is in recv() — these are harmless
Root Cause
This is a WiredTiger eviction assist livelock during step-up with disaggregated storage. The step-up thread:
- Holds the Global X lock (preventing all other DB operations)
- Tries to commit a WiredTiger transaction
- WT forces it into eviction assist due to cache pressure
- Eviction can't make progress (likely because PALI write path or some other subsystem needs something that the Global X lock is blocking, or the eviction servers have stalled for a disagg-specific reason)
- The thread is stuck indefinitely
Suggested Fixes
- Exempt the step-up commit from eviction assist (WT_SESSION flag to skip app-assist during this critical commit)
- Ensure the cache is below the dirty trigger before starting finalizeStepUp
- Investigate why the PALI eviction write path stalls when Global X is held
- depends on
-
SERVER-129649 Exempt step-up thread from eviction assist
-
- Closed
-