Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.4.0-rc8, 4.7.0
Affects Version/s: None
Component/s: Storage
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4
Sprint:
Execution Team 2020-06-01
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Verifying the oplog as part of validate on a running node is a use-case with low utility and incurs a non-trivial code maintenance cost. Because reading the oplog is no longer part of the lock hierarchy, it's the only collection where a "full" validate's collection MODE_X lock does not block readers. WT's verify will result in readers open a cursor to get an EBUSY. Handling that error case has been error-prone; it's turned into an exception that only code-paths that can read the oplog need to handle. Moreover, it's easily forgotten that oplog readers need to handle it.

Some argument for posterity on why verify on the oplog specifically is considered low-value:

Validation with the intent of performing a `verify` already fails to `verify` in many cases.
- WT has constraints about the state of a table on whether `verify` can proceed. In testing, we often see `verify` choosing to not run.
- Once a `verify` returns an EBUSY, it's likely that follow-up `verify`s will also return an EBUSY. It's unclear if customers can get out of this loop without restarting.
It's unclear what corruption a `verify` uncovers that a reader (that is accessing the bad data) would not.
The integrity of the oplog is already covered in testing. Secondaries replicate via the oplog. Any error on a node's oplog has a chance of being replicated and caught in testing.
The oplog naturally rolls over it's data, presumably destroying any corruption a `verify` could capture with it.
Validating the oplog will continue to error on BSON errors (and to be pedantic, presumably check data/index consistency, if users manage to create an index against the oplog).
AFAIK tests have never caught a `verify` error (on the oplog or any other table). The errors we have seen on the oplog are due to visibility and durability contracts across crashes. None of those bugs were ever traced back to something `verify` would detect.

Additionally, this ticket should remove the (improperly added) oplog collection from things that testing should skip.

is related to

SERVER-32704 sys-perf: Skip validating oplog as enabled by SERVER-32243

Closed

Assignee:: Daniel Gottlieb (Inactive)
Reporter:: Daniel Gottlieb (Inactive)
Participants:: Daniel Gottlieb, Githook User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: May 21 2020 04:12:32 PM UTC
Updated:: Oct 29 2023 10:07:52 PM UTC
Resolved:: May 28 2020 08:43:40 PM UTC
Confidence Status Last Update:: 27/May/20 7:55 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates