Disable pre-images removal during data consistency check

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: Change streams
    • None
    • Query Execution
    • Fully Compatible
    • QE 2026-03-30
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This change temporarily disables change streams pre-images removal while the replica set data consistency check in `checkDBHash` is executing. This is necessary so that the contents of the change streams pre-images collection `config.system.preimages` does not change while the consistency check is executing.
      The solution works as follows:
      1. Check if the replica set uses replicated truncates for deletions. Only if this is the case, we need to worry about the consistency check. If replicates truncates are not used, all nodes in the replica set execute pre-images removal independent of each other, and inconsistencies between nodes are allowed.
      2. Turn off the pre-images removal job on the replica set primary. This ensures that the next time the removal job is entered, it will immediately exit without removing any pre-images. This will guard against future invocations of the pre-images removal job.
      3. Wait until a potential currently executing pre-images removal job has finished executing. This is necessary to guard the consistency check against a still-ongoing pre-images removal job that started before the removal job was disabled.
      4. Execute the data consistency check.
      5. Re-enable the pre-images removal job.

      Guarding against a currently executing pre-images removal job is achieved via a new fail point that acts as an invocation counter. This fail point is always enabled when test commands are enabled. The failpoints "timesEntered" value will be unconditionally increased when the pre-images removal job code is entered, and it will be increased again when the pre-images removal job code is exited. Therefore the "timesEntered" value of the failpoint can be used as an indicator if the pre-images removal job is currently executing:

      • timesEntered % 2 == 0: job is currently not executing.
      • timesEntered % 2 == 1: job is currently executing.

      The consistency check waits for `timesEntered" % 2 == 0` before executing the validation.
      This robustifies the data consistency check and allows the removal of a special mechanism that was recently introduced as a stopgap measure.

            Assignee:
            Jan Steemann
            Reporter:
            Jan Steemann
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: