Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48364

Omit verifying the oplog as part of full validate.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc8, 4.7.0
    • Component/s: Storage
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Execution Team 2020-06-01
    • Linked BF Score:
      0

      Description

      Verifying the oplog as part of validate on a running node is a use-case with low utility and incurs a non-trivial code maintenance cost. Because reading the oplog is no longer part of the lock hierarchy, it's the only collection where a "full" validate's collection MODE_X lock does not block readers. WT's verify will result in readers open a cursor to get an EBUSY. Handling that error case has been error-prone; it's turned into an exception that only code-paths that can read the oplog need to handle. Moreover, it's easily forgotten that oplog readers need to handle it.

      Some argument for posterity on why verify on the oplog specifically is considered low-value:

      • Validation with the intent of performing a `verify` already fails to `verify` in many cases.
        • WT has constraints about the state of a table on whether `verify` can proceed. In testing, we often see `verify` choosing to not run.
        • Once a `verify` returns an EBUSY, it's likely that follow-up `verify`s will also return an EBUSY. It's unclear if customers can get out of this loop without restarting.
      • It's unclear what corruption a `verify` uncovers that a reader (that is accessing the bad data) would not.
      • The integrity of the oplog is already covered in testing. Secondaries replicate via the oplog. Any error on a node's oplog has a chance of being replicated and caught in testing.
      • The oplog naturally rolls over it's data, presumably destroying any corruption a `verify` could capture with it.
      • Validating the oplog will continue to error on BSON errors (and to be pedantic, presumably check data/index consistency, if users manage to create an index against the oplog).
      • AFAIK tests have never caught a `verify` error (on the oplog or any other table). The errors we have seen on the oplog are due to visibility and durability contracts across crashes. None of those bugs were ever traced back to something `verify` would detect.

      Additionally, this ticket should remove the (improperly added) oplog collection from things that testing should skip.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              daniel.gottlieb Daniel Gottlieb
              Reporter:
              daniel.gottlieb Daniel Gottlieb
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: