Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-13674

Investigate changes in SERVER-48364: Omit verifying the oplog as part of full validate.

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4.0-rc8, 4.7.0
    • Component/s: manual, Server
    • Labels:
      None

      Description

      Description

      Downstream Change Summary

      The docs for validate "full" can add that in 4.4+, validating the oplog will omit the more thorough checks on WiredTiger.

      https://docs.mongodb.com/manual/reference/command/validate/

      For cloud and drivers: I don't expect this to cause any trouble, but the "warnings" field in a `validate` response can return a new string:
      `Skipping verification of the WiredTiger table for the oplog.`

      Description of Linked Ticket

      Verifying the oplog as part of validate on a running node is a use-case with low utility and incurs a non-trivial code maintenance cost. Because reading the oplog is no longer part of the lock hierarchy, it's the only collection where a "full" validate's collection MODE_X lock does not block readers. WT's verify will result in readers open a cursor to get an EBUSY. Handling that error case has been error-prone; it's turned into an exception that only code-paths that can read the oplog need to handle. Moreover, it's easily forgotten that oplog readers need to handle it.

      Some argument for posterity on why verify on the oplog specifically is considered low-value:

      • Validation with the intent of performing a `verify` already fails to `verify` in many cases.
        • WT has constraints about the state of a table on whether `verify` can proceed. In testing, we often see `verify` choosing to not run.
        • Once a `verify` returns an EBUSY, it's likely that follow-up `verify`s will also return an EBUSY. It's unclear if customers can get out of this loop without restarting.
      • It's unclear what corruption a `verify` uncovers that a reader (that is accessing the bad data) would not.
      • The integrity of the oplog is already covered in testing. Secondaries replicate via the oplog. Any error on a node's oplog has a chance of being replicated and caught in testing.
      • The oplog naturally rolls over it's data, presumably destroying any corruption a `verify` could capture with it.
      • Validating the oplog will continue to error on BSON errors (and to be pedantic, presumably check data/index consistency, if users manage to create an index against the oplog).
      • AFAIK tests have never caught a `verify` error (on the oplog or any other table). The errors we have seen on the oplog are due to visibility and durability contracts across crashes. None of those bugs were ever traced back to something `verify` would detect.

      Additionally, this ticket should remove the (improperly added) oplog collection from things that testing should skip.

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              kay.kim Kay Kim (Inactive)
              Reporter:
              backlog-server-pm Backlog - Core Eng Program Management Team
              Participants:
              Last commenter:
              Backlog - Core Eng Program Management Team Backlog - Core Eng Program Management Team
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Due:
                Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 16 weeks, 2 days ago
                Date of 1st Reply: