-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Engines, Storage Engines - Persistence
-
SE Persistence backlog
-
None
In disagg, there are multiple potential points where WT pages could be corrupted. Currently, such corruption could go unnoticed until a standby or replica attempts to read the page. While the corrupted data can ultimately be reconstructed from oplogs, the time between the corruption and its detection could lead to significant operational impacts.
Potential sources of WT page corruption include:
- Primary sending corrupted pages to the LogServer: Pages may be corrupted before being sent to the LogServer.
- Primary encryption errors: Pages may be encrypted incorrectly by the primary node, either using an incorrect Key Encryption Key (KEK) or Data Encryption Key (DEK).
- PageMaterializer corruption: The PageMaterializer may inadvertently corrupt the page while sending it to the PageServer.
- PageServer corruption: Pages may become corrupted at the PageServer itself.
Proposal
To gain early signals about potential page corruption issues, the primary node could occasionally read back a sample of pages and validate their integrity. This proactive approach would help detect anomalies earlier in the lifecycle, reducing the chances of undetected corruption propagating to replicas. Implementing periodic readbacks and validation would supplement existing monitoring and help catch issues closer to the source.
- related to
-
SERVER-114521 [DS] Metric to Track Failed Requests to PageServer
-
- Blocked
-