[DS] Readback and validate WT pages

XMLWordPrintableJSON

    • Storage Engines, Storage Engines - Persistence
    • SE Persistence backlog
    • None

      In disagg, there are multiple potential points where WT pages could be corrupted. Currently, such corruption could go unnoticed until a standby or replica attempts to read the page. While the corrupted data can ultimately be reconstructed from oplogs, the time between the corruption and its detection could lead to significant operational impacts.

      Potential sources of WT page corruption include:

      1. Primary sending corrupted pages to the LogServer: Pages may be corrupted before being sent to the LogServer.
      2. Primary encryption errors: Pages may be encrypted incorrectly by the primary node, either using an incorrect Key Encryption Key (KEK) or Data Encryption Key (DEK).
      3. PageMaterializer corruption: The PageMaterializer may inadvertently corrupt the page while sending it to the PageServer.
      4. PageServer corruption: Pages may become corrupted at the PageServer itself.

      Proposal

      To gain early signals about potential page corruption issues, the primary node could occasionally read back a sample of pages and validate their integrity. This proactive approach would help detect anomalies earlier in the lifecycle, reducing the chances of undetected corruption propagating to replicas. Implementing periodic readbacks and validation would supplement existing monitoring and help catch issues closer to the source.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Ernesto Rodriguez Reina
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: