[SERVER-48364] Omit verifying the oplog as part of full validate. Created: 21/May/20  Updated: 29/Oct/23  Resolved: 28/May/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.4.0-rc8, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-13674 Investigate changes in SERVER-48364: ... Closed
Related
is related to SERVER-32704 sys-perf: Skip validating oplog as en... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Execution Team 2020-06-01
Participants:
Linked BF Score: 0

 Description   

Verifying the oplog as part of validate on a running node is a use-case with low utility and incurs a non-trivial code maintenance cost. Because reading the oplog is no longer part of the lock hierarchy, it's the only collection where a "full" validate's collection MODE_X lock does not block readers. WT's verify will result in readers open a cursor to get an EBUSY. Handling that error case has been error-prone; it's turned into an exception that only code-paths that can read the oplog need to handle. Moreover, it's easily forgotten that oplog readers need to handle it.

Some argument for posterity on why verify on the oplog specifically is considered low-value:

  • Validation with the intent of performing a `verify` already fails to `verify` in many cases.
    • WT has constraints about the state of a table on whether `verify` can proceed. In testing, we often see `verify` choosing to not run.
    • Once a `verify` returns an EBUSY, it's likely that follow-up `verify`s will also return an EBUSY. It's unclear if customers can get out of this loop without restarting.
  • It's unclear what corruption a `verify` uncovers that a reader (that is accessing the bad data) would not.
  • The integrity of the oplog is already covered in testing. Secondaries replicate via the oplog. Any error on a node's oplog has a chance of being replicated and caught in testing.
  • The oplog naturally rolls over it's data, presumably destroying any corruption a `verify` could capture with it.
  • Validating the oplog will continue to error on BSON errors (and to be pedantic, presumably check data/index consistency, if users manage to create an index against the oplog).
  • AFAIK tests have never caught a `verify` error (on the oplog or any other table). The errors we have seen on the oplog are due to visibility and durability contracts across crashes. None of those bugs were ever traced back to something `verify` would detect.

Additionally, this ticket should remove the (improperly added) oplog collection from things that testing should skip.



 Comments   
Comment by Githook User [ 01/Jun/20 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-48364: Omit verifying the oplog as part of the validate command.

(cherry picked from commit 84e88105afcb373f8a5653c1b294df44b270d305)
Branch: v4.4
https://github.com/mongodb/mongo/commit/c5b3d67564bac46da9f9876066c78127405477df

Comment by Githook User [ 28/May/20 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-48364: Omit verifying the oplog as part of the validate command.
Branch: master
https://github.com/mongodb/mongo/commit/84e88105afcb373f8a5653c1b294df44b270d305

Generated at Thu Feb 08 05:16:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.