[SERVER-33641] Call checkOplogs when checkReplicatedDataHashes fails Created: 02/Mar/18 Updated: 29/Oct/23 Resolved: 22/May/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.0-rc1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | David Bradford (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v4.0
|
||||||||||||
| Sprint: | TIG 2018-05-21, TIG 2018-06-04 | ||||||||||||
| Participants: | |||||||||||||
| Story Points: | 3 | ||||||||||||
| Description |
|
We should do the following to improve the relevance of diagnostics we have in the face of data inconsistency issues:
Original descriptionWe now save all of the data files, but it would be great if the test could check the oplogs automatically and note any differences. |
| Comments |
| Comment by Githook User [ 22/May/18 ] | ||||||||||
|
Author: {'username': 'dbradf', 'name': 'David Bradford', 'email': 'david.bradford@mongodb.com'}Message: (cherry picked from commit 018f33cb0e0f64880295b6d910060365c117a835) | ||||||||||
| Comment by Githook User [ 22/May/18 ] | ||||||||||
|
Author: {'username': 'dbradf', 'name': 'David Bradford', 'email': 'david.bradford@mongodb.com'}Message: | ||||||||||
| Comment by Judah Schvimer [ 23/Apr/18 ] | ||||||||||
|
We currently also do not check oplog consistency in the kill_secondaries passthrough or in ReplSetTest stopSet. | ||||||||||
| Comment by Judah Schvimer [ 06/Mar/18 ] | ||||||||||
|
I agree it's less necessary, but it certainly would still be helpful to not have to download the logs and set up the cluster again and rerun the checks myself. | ||||||||||
| Comment by Max Hirschhorn [ 06/Mar/18 ] | ||||||||||
It sounds like maybe you really want to have a mode for running all of the data consistency checks and getting all of their output. Does this request become less relevant if we were to archive the data files any time ReplSetTest#checkOplogs() or ReplSetTest#checkReplicatedDataHashes() fails and not just when they are called by resmoke.py's CheckReplOplogs and CheckReplDBHash hooks, respectively? I imagine there's still some value to seeing the possibly multiple failure messages in the logs before downloading the data files. | ||||||||||
| Comment by Judah Schvimer [ 05/Mar/18 ] | ||||||||||
|
Yes that would be sufficient. Additionally, when checkReplicatedDataHashes fails, running validate and logging the output would be useful. Index corruption on the _id index can manifest as a DB Hash mismatch, which is misleading. | ||||||||||
| Comment by Max Hirschhorn [ 02/Mar/18 ] | ||||||||||
|
judah.schvimer, we currently run the CheckReplOplogs hook (which calls ReplSetTest#checkOplogs() function) before running the CheckReplDBHash hook (which calls ReplSetTest#checkReplicatedDataHashes()) because of the thought that the oplogs much be consistent across nodes in order for there to be any chance for the data to be consistent across nodes. Should we instead change ReplSetTest#stopSet() to call ReplSetTest#checkOplogs() when the replica set is being terminated by the test before it calls ReplSetTest#checkReplicatedDataHashes()?
|