[SERVER-33641] Call checkOplogs when checkReplicatedDataHashes fails Created: 02/Mar/18  Updated: 29/Oct/23  Resolved: 22/May/18

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.0.0-rc1

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: David Bradford (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-35403 Should not attempt to close a non-exi... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0
Sprint: TIG 2018-05-21, TIG 2018-06-04
Participants:
Story Points: 3

 Description   

We should do the following to improve the relevance of diagnostics we have in the face of data inconsistency issues:

  1. Update ReplSetTest#stopSet() to call ReplSetTest#checkOplogs() in addition to ReplSetTest#checkReplicatedDataHashes(). Care should be taken to ensure that tests do not run significantly longer because they need to verify a large oplog when shutting down the replica set.
  2. Update the PeriodicKillSecondaries hook to run the CheckReplOplogs hook in addition to the CheckReplDBHash and ValidateCollections hooks.
Original description

We now save all of the data files, but it would be great if the test could check the oplogs automatically and note any differences.



 Comments   
Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'dbradf', 'name': 'David Bradford', 'email': 'david.bradford@mongodb.com'}

Message: SERVER-33641: Check oplogs in tests on stop repl set

(cherry picked from commit 018f33cb0e0f64880295b6d910060365c117a835)
Branch: v4.0
https://github.com/mongodb/mongo/commit/4f2b182876959f740ae257cf9394fbb091386720

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'dbradf', 'name': 'David Bradford', 'email': 'david.bradford@mongodb.com'}

Message: SERVER-33641: Check oplogs in tests on stop repl set
Branch: master
https://github.com/mongodb/mongo/commit/018f33cb0e0f64880295b6d910060365c117a835

Comment by Judah Schvimer [ 23/Apr/18 ]

We currently also do not check oplog consistency in the kill_secondaries passthrough or in ReplSetTest stopSet.

Comment by Judah Schvimer [ 06/Mar/18 ]

I agree it's less necessary, but it certainly would still be helpful to not have to download the logs and set up the cluster again and rerun the checks myself.

Comment by Max Hirschhorn [ 06/Mar/18 ]

Yes that would be sufficient.

Additionally, when checkReplicatedDataHashes fails, running validate and logging the output would be useful. Index corruption on the _id index can manifest as a DB Hash mismatch, which is misleading.

It sounds like maybe you really want to have a mode for running all of the data consistency checks and getting all of their output. Does this request become less relevant if we were to archive the data files any time ReplSetTest#checkOplogs() or ReplSetTest#checkReplicatedDataHashes() fails and not just when they are called by resmoke.py's CheckReplOplogs and CheckReplDBHash hooks, respectively? I imagine there's still some value to seeing the possibly multiple failure messages in the logs before downloading the data files.

Comment by Judah Schvimer [ 05/Mar/18 ]

Yes that would be sufficient.

Additionally, when checkReplicatedDataHashes fails, running validate and logging the output would be useful. Index corruption on the _id index can manifest as a DB Hash mismatch, which is misleading.

Comment by Max Hirschhorn [ 02/Mar/18 ]

judah.schvimer, we currently run the CheckReplOplogs hook (which calls ReplSetTest#checkOplogs() function) before running the CheckReplDBHash hook (which calls ReplSetTest#checkReplicatedDataHashes()) because of the thought that the oplogs much be consistent across nodes in order for there to be any chance for the data to be consistent across nodes. Should we instead change ReplSetTest#stopSet() to call ReplSetTest#checkOplogs() when the replica set is being terminated by the test before it calls ReplSetTest#checkReplicatedDataHashes()?

hooks:
# The CheckReplDBHash hook waits until all operations have replicated to and have been applied
# on the secondaries, so we run the ValidateCollections hook after it to ensure we're
# validating the entire contents of the collection.
- class: CheckPrimary
- class: CheckReplOplogs
- class: CheckReplDBHash
- class: ValidateCollections
- class: CleanEveryN
n: 20

Generated at Thu Feb 08 04:34:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.