Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-10551

Incremental backup may omit modified blocks

    • StorEng - Refinement Pipeline
    • v7.0, v6.3, v6.2, v6.0, v5.0, v4.4

      Issue Status as of May 23

      ISSUE DESCRIPTION
      WiredTiger represents the set of changes between uses of block-based incremental backup cursors as a bitmap, where each bit corresponds to an extent (16 MB by default) of the underlying file. WiredTiger saves this bitmap as a part of each checkpoint so that if the system restarts, or if the file is closed due to inactivity, and later reopened, WiredTiger can read back the bitmap to determine which parts of the file have changed and thus need to be included in a future incremental backup.

      WT-7524 introduced a bug where WiredTiger can open a file and mistakenly fail to load the incremental change bitmap for that file. In this scenario, WiredTiger assumes there have been no changes to the file and initializes a new, empty bitmap for that file. As a result, blocks that were modified before this error may not be included in a subsequent incremental backup. Updates after this error will be correctly included in the bitmap, and if those changes touch the same blocks that were lost, those blocks will be added back into the bitmap, inadvertently healing some or all of the damage.

      If a damaged backup is restored and used, WiredTiger’s internal checksum protection will detect any attempt to read data from the missing portion(s) of the file, causing MongoDB to crash.

      MONGODB IMPACT
      This issue can cause inconsistencies in the data files of incremental backups performed by Ops Manager and Cloud Manager clusters running MongoDB versions 4.4.8-4.4.21, 5.0.2-5.0.17, and 6.0.0-6.0.5.

      When affected incremental backups are restored, they crash with checksum errors if the affected data is accessed. Full backups (performed weekly by default) are not affected. Backups performed by [documented methods other than Ops and Cloud Manager backups|​​https://www.mongodb.com/docs/manual/core/backups/] are not affected.

      DIAGNOSIS
      The issue can be diagnosed by the presence of checksum errors after restoring from an impacted incremental backup, such as:

      WT_CURSOR.next: __wt_block_read_off, 302: XXXX.wt: potential hardware corruption, read checksum error for 4096B block at offset XXXXX: block header checksum of XXXXX doesn't match expected checksum of XXXXX
      

      WORKAROUND
      If a workaround is needed on MongoDB versions 4.4.8-4.4.21, 5.0.2-5.0.17, and 6.0.0-6.0.5, configure Ops and Cloud Manager to take full backups only. Otherwise, upgrade to MongoDB version 4.4.22, 5.0.18, and 6.0.6.

      Upgrading to a fixed version of MongoDB does not correct existing backups. See the Remediation section below for more information about restoring an affected backup:
      Existing incremental backups that were taken previously may be corrupted and would need to be restored and validated to ensure correctness.
      If a backup restore from an affected version is needed, strongly consider using the most recent full backup.

      REMEDIATION

      First, upgrade to a version containing the fix and perform a full backup. Ops and Cloud Manager incremental backups from here on will be safe.

      Any incremental backups taken while on a vulnerable MongoDB version could be affected, and the completeness of data in each backup is not guaranteed. Additional action is necessary as part of any restore process involving one of these backups. We recommend:

      If you need to restore a backup, consider the nearest full backup, as full backups are not affected by this issue.
      To check an incremental backup:
      Restore the backup to a new cluster. Do not restore an incremental backup on a live cluster until you confirm it is safe in the following steps.
      Run the validate command on all collections (the validate.js script is available to run validate iteratively on multiple databases/collections)
      If any collections or indexes fail validation due to checksum errors:
      If any collections are affected (collection-*.wt), run mongodb repair. If you run repair, it will resolve all checksum errors and remove the need to perform the next step.
      If only indexes are affected (index-*.wt), drop and recreate the affected indexes. To determine the affected indexes, use the collStats command to find what data file it maps to.
      If issues persist after performing mongodb repair, consider restoring from the most recent full backup.

      Original Description

      The set of changes performed between uses of block-based incremental backup cursors is represented as a bitmap, where each bit corresponds to an extent (16 MB by default) of the underlying file. WiredTiger saves this bitmap as a part of each checkpoint so that if the system restarts or if the file is closed due to inactivity and later reopened, WiredTiger can read back the bitmap to determine which parts of the file have changed and thus need to be included in a future incremental backup.
      WT-7524 introduced a bug where WiredTiger can open a file and mistakenly fail to load the incremental change bitmap for that file. In this scenario, WiredTiger assumes there have been no changes to the file and initializes a new, empty bitmap for that file. As a result, blocks that were modified before this error may not be included in a subsequent incremental backup. Updates after this error will be correctly included in the bitmap, and if those changes touch the same blocks that were lost those blocks will be added back into the bitmap, inadvertently healing some or all of the damage.
      If a damaged backup is restored and used, WiredTiger’s internal checksum protection will detect any attempt to read data from the missing portion(s) of the file, causing MongoDB to crash.

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated:
              Resolved: