-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Reconciliation
-
None
-
Storage Engines, Storage Engines - Transactions
-
SE Transactions - 2025-06-06, SE Transactions - 2025-06-20, SE Transactions - 2025-07-04
-
5
-
(copied to CRM)
-
0
Description:
We observed a significant repl lag (lasting over an hour) in one of the help tickets. The issue resolved automatically once but mostly a manual host restart is required to stop the lagging on the node, such lag is not acceptable in production environments.
Upon analysing the FTDC data, we found that the repl thread was stalled trying to access a page held under an exclusive lock by eviction. Hence, the cause appears to be slowness in eviction. Based on the FTDC data and flame graphs collected over a 4-minute trace window eviction slowness seems to be caused by reconciliation moving updates to the HS.
Problem:
Currently, we lack sufficient visibility into how reconciliation progresses while moving updates to the HS. This makes it difficult to diagnose and respond to performance issues like replication lag when they occur.
Action Items:
- Define what diagnostic data should be collected (e.g., time spent, number of updates, retries).
- Add logging or expose new FTDC metrics.
- Evaluate the need for backporting.