[SERVER-48934] Investigate replica set data mismatch from Jepsen testing Created: 12/Jun/20 Updated: 29/Oct/23 Resolved: 22/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alexander Gorrod | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Sprint: | Execution Team 2020-06-29 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 91 | ||||||||||||||||||||
| Description |
|
Our internal testing of MongoDB 4.4 using the Jepsen testing methodology is reporting that different members of a replica set have different data at the end of a test. The issue appeared to start with this drop of WiredTiger - though the failure is sporadic, so that's not a strong indicator that a bug was introduced in that set of changes. The Jepsen test simulates different failure modes in a distributed system - so it's possible that this failure is related to recovery, rollback to stable or replication rollback via refetch. We should dig into the failure, and try to find a way to isolate the root cause. |
| Comments |
| Comment by Githook User [ 22/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 22/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Created SERVER-49017 to add a regression test. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 19/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4.2.6 never resuscitates timestamps, it just ignores them. So I believe your analysis is correct that this problem is specific to 4.4. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like Lingzhi already eliminated that AutoGetCollectionForRead call in v4.5 with his introduction of AutoGetOplog helper in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looking back at the CR in which I put in the AutoGetCollectionForRead ( | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Interesting. I'm curious why we call the AutoGetCollectionForRead helper there at all. It's just used for an invariant that the oplog is present, which feels unnecessary to me. Great find everyone, and thanks william.schultz for clarifying why we only are just seeing this! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I agree william.schultz. There needs to be an AutoGetCollection to "poison" the recovery unit's read source to use lastApplied, but I didn't hunt down where that was happening. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
daniel.gottlieb Your explanation sounds correct, but to add some additional clarification, I believe that this issue only arises when the truncateAfterPoint is <= the stable timestamp, due to this condition. If we enter that block, we call the AutoGetCollectionForRead helper, which sets the problematic read timestamp as you point out. That should explain why we only saw this occurring when the truncateAfterPoint was equal to the stable timestamp. In fact, it's not clear to me why we're even making that particular call to AutoGetCollectionForRead, other than to check an invariant. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A more thoughtful explanation of what's happening. Since MDB 4.0 with the inclusion of "reading at last applied" for secondaries, AutoGetCollection changes the "read source" (what timestamp a transaction reads at). The logic for reading at last applied gets engaged when a node is in startup state. When it becomes time to begin a transaction on the "startup" opCtx/recovery unit, lastApplied has been initialized to the recovery timestamp. The transaction truncating the oplog is reading at the checkpoint/recovery timestamp value for what should be visible, resulting in the truncation code not finding anything. This is effectively the root cause of the symptom. Once the oplog is not actually truncated, everything else is kosher to me. I claim the same sequence is not a problem on 4.2 (with one possible caveat, though we absolutely should verify). Prior to durable history, timestamp information in WT is an in-memory only thing. It's never persisted in a checkpoint. If MongoDB wants timestamp structure across a restart, replication recovery has to replay the writes. Thus I posit in 4.2, while we may be applying a WT read timestamp, the entire oplog is visible at startup. However in 4.4+, durable history does* preserve timestamps across restart and those timestamps are used for visibility. The one caveat is that if one downgrades from 4.4 -> 4.2.6+. Will 4.2.6 read 4.4 checkpoint data and reconstruct timestamps from it alexander.gorrod? Ultimately, I think my patch is the right fix here. I think the real real* fix is reading at last applied should be opt-in only (never implicit). The latter is obviously much more complicated. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Louis Williams [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We may want to consider scheduling | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following patch fixes william.schultz's repro:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attaching a diff of a local repro (wt-6411-repro.diff | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks daniel.gottlieb - looks like we raced. Let us know if we can help. keith.smith it would be great if you can get up to speed on Jepsen and the failure - it's useful to know how Jepsen works regardless of where the root cause of this particular problem is. If it's better to unlink this ticket and continue work in the BF directly that's fine with me. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have not made much progress on this in the last day, so I've reached out for some help from keith.smith to try to move it forward. I'm also going to move it to a server ticket - there is no strong indication that this is a bug in WiredTiger at the moment. We don't have other failures with a similar signature. We have a few failures where a collection and index get de-synchronised, but I'm not aware of failures where different members of a replica set diverge (as is reported here). william.schultz are you available tomorrow if Keith has questions either about specifics of what is happening or how to run the Jepsen tests locally? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Assigning the ticket to execution. The NY server team typically diagnoses BFs in the BF itself. I'll let milkie decide the best JIRA workflow here. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I don't think this is a WT bug, but rather something off with replicate before journaling/replication recovery. For one, my claim that this is not a WT bug. The problems we've been seeing are with data going in and out of cache overflow. When a bug like this happens, some pieces of the transaction (e.g: the index entry) survive while others get lost (e.g: the record itself). Grepping a dump of the collection + index + history store shows the problem _id ObjectId does not exist in any of them:
(Fun fact, when dumping WT data with -x for hex, ObjectId strings will show up without requiring any BSON parsing) Following on with william.schultz's latest observation. The problem document in the patch I looked at:
Noting the timestamp: Timestamp(1592343121, 617) (further known as "617")
The oplog truncate after point logic at startup:
Then the node applies from the recovery timestamp to the top of oplog (the node finds there's nothing to do):
The node is now up and begins to start it's replication process. The first thing is to do is query the top of oplog to start tailing the oplog from a sync source. This finds "617" which was supposed to be truncated from above:
Replication thinks it should start tailing the oplog starting at things that come after 617, which explains why 617 never gets applied:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 18/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Not really sure if there's any significance to this yet, but when I was poking around the data files from some of the failures, I noticed that the timestamps for the inserts that were missing all appeared to be 1 oplog operation away from the timestamp of a stable checkpoint that was taken. Here are examples from 3 failures, with output from grepping the test results and querying the data files:
This could certainly be a red herring, but it seemed like it could be some kind of lead, so I figured I'd post my findings. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I also ran a repro with higher storage level logging verbosity (level 2 for all of STORAGE component). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Posting my debugging steps here in more detail for anyone who might continue this investigation. From my patch build with the repros, I looked at this failure. Jepsen runs a client process which runs operations against the replica set and injects failures, and the logs for that process are contained in the "Jepsen Test Results" directory, which can be downloaded under the "Files" pane in Evergreen. Inside that directory, there are a bunch of files, but if you go down one directory level, there is a jepsen.log file, which contains the log output for the Jepsen client, which includes the output from any dbhash checks. Searching for "dbhash mismatch between primary and secondary" in that file should take you to the offending log message. We also log the exact document that was missing on a particular node that caused the dbhash mismatch:
The logs for that node can also be found inside the "Jepsen Test Results" directory in ip-10-122-21-101/20002/mongod.log. In my patch build, I modified the Evergreen configuration so that we also save the data files for each node. Those data files are saved in a separate directory, which can be downloaded as "Jepsen mongod Logs" from the Evergreen "Files" pane. After downloading and untar'ing that directory, I started a mongod on the data directory for the node that was missing a document:
Then I connected to the node from the mongo shell:
and tried to see if the document that was missing in the dbhash check was actually gone. In this particular test, all data is inserted into the jepsen.read-concern-majority collection, so we look in there for the document. Note that this test workload is insert only: it simply inserts documents with an increasing numeric 'value' field. Here is what the shell turns up:
The document was missing, as we saw in the dbhash check, but the insert that created the document was not missing from the oplog, and we can see its timestamp. Perhaps trying to correlate the timestamp of the insert with a certain checkpoint might give some clues about where the corruption exists. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexander.gorrod I took a quick look at the data directories - there are 5, ip-10-122-20-144:2000[01234]. Looking at the turtle files for each one, the checkpoint LSN is vastly different. Directories 0, 3, 4 have LSNs in log file 2. Directory 1 is in log file 14 and Directory 2 is in log file 3. All the directories have different collection and index file names. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. I had downloaded the Jepsen test results and a couple others. I have them and see the directories now. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sue.loverso If you download "Jepsen mongod Logs" and untar it you should get a set of directories, one for each node. The mongo data directory will live inside a directory like ip-10-122-21-101:20000/data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
william.schultz which of the Files contains the data files? I downloaded a few that seemed logical and cannot find the database directories to look at the data files. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
After downloading the data files for one of the failed runs in my repro patch, I am able to observe the same data mismatch reported by the dbhash check from within a mongo shell i.e. one of the secondaries is missing a document that is present on the other nodes. This would seem to imply there is some kind of data corruption of the data on disk. Because of this, there might be more info to be gleaned about where the corruption occurred/exists by looking at the data files from the WT layer. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I started to look at this and ran a patch build on the tip of v4.4 to see if I could repro it and capture data files for each mongod, which we don't appear to do by default. Looks like it repro'd a couple of times so I am going to see if the data files might shed any light on the root cause. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I did some spelunking on the Evergreen failure, this might be useful when it comes to reproducing the problem. The test is running:
The mongod_verbose.conf file seems to contain:
It seems as though nodes.txt refers to:
but I can't see where that file comes from, or whether Jepsen expects those nodes to already be running and configured. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
tess.avitabile and robert.guo the Storage Engines team doesn't know how to get started on chasing this failure. Could you recommend someone we could work with to start digging into what might be going wrong? I have a few specific questions:
I'm not sure what an error code of 252 is, a google search doesn't give any insight into what that error code means. Is that failure mode familiar to you? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Susan LoVerso [ 16/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexander.gorrod only one of my 4 runs failed (the 50% 'half bisect' one). I agree that is not something that seems related to a specific commit. I cannot tell what actually happened because loading the raw output hung my browser. I am going to put this back on the backlog as I have no idea how to chase this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 15/Jun/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I went back through the history of the job, and I'm not sure my initial analysis was correct. The Jepsen tests weren't completing for a long time (from at least mid April). The first run that finished was on May 15th - that failure has a different signature to the error we are chasing here about dbhash mismatch. There were then 5 successful runs of the Jepsen test, and then a failure with a signature matching that in this ticket on May 20th. Since that first failure there have been 27 successful test runs and 13 failed runs (counts are approximate based on scrolling through and counting). So approximately 1 in 3 runs are failing. It's possible that this issue has been present since Jepsen testing started being run on the 4.4 branch of MongoDB. |