[SERVER-48934] Investigate replica set data mismatch from Jepsen testing Created: 12/Jun/20  Updated: 29/Oct/23  Resolved: 22/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.4.0

Type: Bug Priority: Major - P3
Reporter: Alexander Gorrod Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File wt-6411-repro.diff    
Issue Links:
Depends
is depended on by SERVER-49017 Add a test to ensure on startup that ... Backlog
Related
related to SERVER-34091 Oplog visibility rules can cause capp... Closed
related to SERVER-41388 Truncate the oplog after the stable t... Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2020-06-29
Participants:
Linked BF Score: 91

 Description   

Our internal testing of MongoDB 4.4 using the Jepsen testing methodology is reporting that different members of a replica set have different data at the end of a test.

The issue appeared to start with this drop of WiredTiger - though the failure is sporadic, so that's not a strong indicator that a bug was introduced in that set of changes.

The Jepsen test simulates different failure modes in a distributed system - so it's possible that this failure is related to recovery, rollback to stable or replication rollback via refetch.

We should dig into the failure, and try to find a way to isolate the root cause.



 Comments   
Comment by Githook User [ 22/Jun/20 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-48934 Remove unneeded AutoGetCollectionForRead call during oplog truncation that assigns an unwanted ReadSource
Branch: v4.4
https://github.com/mongodb/mongo/commit/96d0a3a985d383473af889dd9a089c44d5f492e2

Comment by Dianna Hohensee (Inactive) [ 22/Jun/20 ]

Created SERVER-49017 to add a regression test.

Comment by Alexander Gorrod [ 19/Jun/20 ]

The one caveat is that if one downgrades from 4.4 -> 4.2.6+. Will 4.2.6 read 4.4 checkpoint data and reconstruct timestamps from it

4.2.6 never resuscitates timestamps, it just ignores them. So I believe your analysis is correct that this problem is specific to 4.4.

Comment by Dianna Hohensee (Inactive) [ 18/Jun/20 ]

It looks like Lingzhi already eliminated that AutoGetCollectionForRead call in v4.5 with his introduction of AutoGetOplog helper in SERVER-45626. So v4.5 is all set.

Comment by Dianna Hohensee (Inactive) [ 18/Jun/20 ]

Looking back at the CR in which I put in the AutoGetCollectionForRead (SERVER-41388), the need for accessing the oplog collection there was refactored out after the first CR patch set, so its continued presence looks like an oversight. There's no reason for it and it can be eliminated safely, I think.

Comment by Judah Schvimer [ 18/Jun/20 ]

Interesting. I'm curious why we call the AutoGetCollectionForRead helper there at all. It's just used for an invariant that the oplog is present, which feels unnecessary to me. Great find everyone, and thanks william.schultz for clarifying why we only are just seeing this!

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

I agree william.schultz. There needs to be an AutoGetCollection to "poison" the recovery unit's read source to use lastApplied, but I didn't hunt down where that was happening.

Comment by William Schultz (Inactive) [ 18/Jun/20 ]

daniel.gottlieb Your explanation sounds correct, but to add some additional clarification, I believe that this issue only arises when the truncateAfterPoint is <= the stable timestamp, due to this condition. If we enter that block, we call the AutoGetCollectionForRead helper, which sets the problematic read timestamp as you point out. That should explain why we only saw this occurring when the truncateAfterPoint was equal to the stable timestamp. In fact, it's not clear to me why we're even making that particular call to AutoGetCollectionForRead, other than to check an invariant.

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

A more thoughtful explanation of what's happening.

Since MDB 4.0 with the inclusion of "reading at last applied" for secondaries, AutoGetCollection changes the "read source" (what timestamp a transaction reads at). The logic for reading at last applied gets engaged when a node is in startup state. When it becomes time to begin a transaction on the "startup" opCtx/recovery unit, lastApplied has been initialized to the recovery timestamp.

The transaction truncating the oplog is reading at the checkpoint/recovery timestamp value for what should be visible, resulting in the truncation code not finding anything. This is effectively the root cause of the symptom. Once the oplog is not actually truncated, everything else is kosher to me.

I claim the same sequence is not a problem on 4.2 (with one possible caveat, though we absolutely should verify). Prior to durable history, timestamp information in WT is an in-memory only thing. It's never persisted in a checkpoint. If MongoDB wants timestamp structure across a restart, replication recovery has to replay the writes. Thus I posit in 4.2, while we may be applying a WT read timestamp, the entire oplog is visible at startup. However in 4.4+, durable history does* preserve timestamps across restart and those timestamps are used for visibility.

The one caveat is that if one downgrades from 4.4 -> 4.2.6+. Will 4.2.6 read 4.4 checkpoint data and reconstruct timestamps from it alexander.gorrod?

Ultimately, I think my patch is the right fix here. I think the real real* fix is reading at last applied should be opt-in only (never implicit). The latter is obviously much more complicated.

Comment by Louis Williams [ 18/Jun/20 ]

We may want to consider scheduling SERVER-48452 now (or doing the work in this ticket). This would ensure internal operations always default to reading without a timestamp. Surprisingly, this isn't the default. Because of SERVER-34192, AutoGetCollectionForRead sometimes changes an operation's ReadSource to "lastApplied" internally.

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

The following patch fixes william.schultz's repro:

dgottlieb@chimichurri ~/xgen/mongo44[v4.4]$ git diff src/mongo/db/db.cpp
diff --git a/src/mongo/db/db.cpp b/src/mongo/db/db.cpp
index 598408025f..e7e8c78f49 100644
--- a/src/mongo/db/db.cpp
+++ b/src/mongo/db/db.cpp
@@ -435,6 +435,7 @@ ExitCode _initAndListen(ServiceContext* serviceContext, int listenPort) {
     }
 
     auto startupOpCtx = serviceContext->makeOperationContext(&cc());
+    startupOpCtx->recoveryUnit()->setTimestampReadSource(RecoveryUnit::ReadSource::kNoTimestamp);
 
     bool canCallFCVSetIfCleanStartup =
         !storageGlobalParams.readOnly && (storageGlobalParams.engine != "devnull");

Comment by William Schultz (Inactive) [ 18/Jun/20 ]

Attaching a diff of a local repro (wt-6411-repro.diff, applied to v4.4 branch) that appears to be the same issue i.e. we have a primary and secondary that both apply 3 ops, then take a stable checkpoint. Then, the primary does one more write, and then secondary writes down the oplog entry for it and advances its oplogTruncateAfterPoint, but then crashes before applying the op. Upon recovery it appears to erroneously skip proper truncation, and the secondary never re-applies the 4th op, leading to a missing document on secondary.

Comment by Alexander Gorrod [ 18/Jun/20 ]

Thanks daniel.gottlieb - looks like we raced. Let us know if we can help. keith.smith it would be great if you can get up to speed on Jepsen and the failure - it's useful to know how Jepsen works regardless of where the root cause of this particular problem is.

If it's better to unlink this ticket and continue work in the BF directly that's fine with me.

Comment by Alexander Gorrod [ 18/Jun/20 ]

I have not made much progress on this in the last day, so I've reached out for some help from keith.smith to try to move it forward.

I'm also going to move it to a server ticket - there is no strong indication that this is a bug in WiredTiger at the moment. We don't have other failures with a similar signature. We have a few failures where a collection and index get de-synchronised, but I'm not aware of failures where different members of a replica set diverge (as is reported here).

william.schultz are you available tomorrow if Keith has questions either about specifics of what is happening or how to run the Jepsen tests locally?

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

Assigning the ticket to execution. The NY server team typically diagnoses BFs in the BF itself. I'll let milkie decide the best JIRA workflow here.

Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ]

I don't think this is a WT bug, but rather something off with replicate before journaling/replication recovery. For one, my claim that this is not a WT bug. The problems we've been seeing are with data going in and out of cache overflow. When a bug like this happens, some pieces of the transaction (e.g: the index entry) survive while others get lost (e.g: the record itself). Grepping a dump of the collection + index + history store shows the problem _id ObjectId does not exist in any of them:

dgottlieb@chimichurri ~/xgen/jepsen$ grep 5ee93a3e3f8ab40afae76ebd index collection history printlog 
dgottlieb@chimichurri ~/xgen/jepsen$ 

(Fun fact, when dumping WT data with -x for hex, ObjectId strings will show up without requiring any BSON parsing)

Following on with william.schultz's latest observation. The problem document in the patch I looked at:

test> oplog.findOne({"o._id": ObjectId("5ee93a3e3f8ab40afae76ebd")})
{
	"op" : "i",
	"ns" : "jepsen.read-concern-majority",
	"ui" : UUID("72ac3b86-6959-4d79-a637-7dbaf1829e03"),
	"o" : {
		"_id" : ObjectId("5ee93a3e3f8ab40afae76ebd"),
		"value" : NumberLong(6339)
	},
	"ts" : Timestamp(1592343121, 617),
	"t" : NumberLong(15),
	"wall" : ISODate("2020-06-16T21:31:42.589Z"),
	"v" : NumberLong(2)
}

Noting the timestamp: Timestamp(1592343121, 617) (further known as "617")
The process that should have inserted that document gets started on a checkpoint "616":

2020-06-16T21:31:53.561+00:00 I  RECOVERY [initandlisten] WiredTiger recoveryTimestamp{"recoveryTimestamp":{"$timestamp":{"t":1592343121,"i":616}}}

The oplog truncate after point logic at startup:

2020-06-16T21:31:53.578+00:00 I  REPL     [initandlisten] The oplog truncation point is equal to or earlier than the stable timestamp, so truncating after the stable timestamp instead{"truncatePoint":{"$timestamp":{"t":1592343121,"i":616}},"stableTimestamp":{"$timestamp":{"t":1592343121,"i":616}}}
2020-06-16T21:31:53.578+00:00 I  REPL     [initandlisten] Removing unapplied oplog entries after oplogTruncateAfterPoint{"oplogTruncateAfterPoint":{"":{"$timestamp":{"t":1592343121,"i":616}}}}
2020-06-16T21:31:53.579+00:00 I  REPL     [initandlisten] Truncating oplog from truncateAfterOplogEntryTimestamp (non-inclusive){"truncateAfterOplogEntryTimestamp":{"$timestamp":{"t":1592343121,"i":616}},"oplogTruncateAfterPoint":{"$timestamp":{"t":1592343121,"i":616}}}
2020-06-16T21:31:53.579+00:00 D2 STORAGE  [initandlisten] Updating the oplogReadTimestamp.{"newOplogReadTimestamp":{"$timestamp":{"t":1592343121,"i":616}}}
2020-06-16T21:31:53.579+00:00 D1 STORAGE  [initandlisten] truncation new read timestamp: {"$timestamp":{"t":1592343121,"i":616}}

Then the node applies from the recovery timestamp to the top of oplog (the node finds there's nothing to do):

2020-06-16T21:31:53.580+00:00 I  REPL     [initandlisten] Recovering from stable timestamp{"stableTimestamp":{"$timestamp":{"t":1592343121,"i":616}},"topOfOplog":{"ts":{"$timestamp":{"t":1592343121,"i":616}},"t":15},"appliedThrough":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"oplogTruncateAfterPoint":{"$timestamp":{"t":0,"i":0}}}
2020-06-16T21:31:53.580+00:00 I  REPL     [initandlisten] Starting recovery oplog application at the stable timestamp{"stableTimestamp":{"$timestamp":{"t":1592343121,"i":616}}}
2020-06-16T21:31:53.580+00:00 I  REPL     [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog

The node is now up and begins to start it's replication process. The first thing is to do is query the top of oplog to start tailing the oplog from a sync source. This finds "617" which was supposed to be truncated from above:

2020-06-16T21:31:53.611+00:00 D1 REPL     [ReplCoord-2] Ignoring commit point with different term than my lastApplied, since it may not be on the same oplog branch as mine{"committedOpTime":"{ ts: Timestamp(1592343121, 668), t: 16 }, 2020-06-16T21:31:53.255+00:00","myLastAppliedOpTimeAndWallTime":"{ ts: Timestamp(1592343121, 616), t: 15 }, 2020-06-16T21:31:42.574+00:00"}
2020-06-16T21:31:53.611+00:00 D1 REPL     [BackgroundSync] Successfully read last entry of oplog while starting bgsync{"lastOplogEntry":{"op":"i","ns":"jepsen.read-concern-majority","ui":{"$uuid":"72ac3b86-6959-4d79-a637-7dbaf1829e03"},"o":{"_id":{"$oid":"5ee93a3e3f8ab40afae76ebd"},"value":6339},"ts":{"$timestamp":{"t":1592343121,"i":617}},"t":15,"wall":{"$date":"2020-06-16T21:31:42.589Z"},"v":2}}

Replication thinks it should start tailing the oplog starting at things that come after 617, which explains why 617 never gets applied:

2020-06-16T21:31:54.632+00:00 I  REPL     [BackgroundSync] Scheduling fetcher to read remote oplog{"syncSource":"ip-10-122-19-8:20004","lastOpTimeFetched":{"ts":{"$timestamp":{"t":1592343121,"i":617}},"t":15}}

Comment by William Schultz (Inactive) [ 18/Jun/20 ]

Not really sure if there's any significance to this yet, but when I was poking around the data files from some of the failures, I noticed that the timestamps for the inserts that were missing all appeared to be 1 oplog operation away from the timestamp of a stable checkpoint that was taken. Here are examples from 3 failures, with output from grepping the test results and querying the data files:

Failure 1

$ grep -A 1 "missing on the" mongodb_test-read-concern-majority_storageEngine-wiredTiger_protocolVersion-1/20200616T212729.000Z/jepsen.log                                                          
The following documents are missing on the secondary ip-10-122-19-8:20001:
{  "_id" : ObjectId("5ee93a3e3f8ab40afae76ebd"),  "value" : NumberLong(6339) }
 
williams-ubuntu:29000(mongod-4.4.0) local> db.oplog.rs.find({op:"i", "o.value": 6339})
{
  "op": "i",
  "ns": "jepsen.read-concern-majority",
  "ui": UUID("72ac3b86-6959-4d79-a637-7dbaf1829e03")),
  "o": {
    "_id": ObjectId("5ee93a3e3f8ab40afae76ebd"),
    "value": NumberLong("6339")
  },
  "ts": Timestamp(1592343121, 617),
  "t": NumberLong("15"),
  "wall": ISODate("2020-06-16T21:31:42.589Z"),
  "v": NumberLong("2")
}
Fetched 1 record(s) in 42ms
 
$ grep ag "Performing stable" ip-10-122-19-8:20001/mongod.log
...
{"t":{"$date":"2020-06-16T21:31:42.593+00:00"},"s":"D2", "c":"RECOVERY", "id":23986,   "ctx":"WTCheckpointThread","msg":"Performing stable checkpoint. StableTimestamp: {stableTimestamp}, OplogNeededForRollback: {oplogNeededForRollback}","attr":{"stableTimestamp":{"$timestamp":{"t":1592343121,"i":616}},"oplogNeededForRollback":"Timestamp(1592343121, 616)"}}
...

Failure 2

$ grep -A 1 "missing on the" mongodb_test-read-concern-majority_storageEngine-wiredTiger_protocolVersion-1/20200616T212710.000Z/jepsen.log                                                         !10692
The following documents are missing on the secondary ip-10-122-30-22:20001:
{  "_id" : ObjectId("5ee93a2b0c3b3d0b12729d9b"),  "value" : NumberLong(6237) }
 
 
williams-ubuntu:29000(mongod-4.4.0) local>
> db.oplog.rs.find({op:"i", "o.value": 6237})
{
  "op": "i",
  "ns": "jepsen.read-concern-majority",
  "ui": UUID("25f9a7ff-ee08-4fc0-a18b-258710151905")),
  "o": {
    "_id": ObjectId("5ee93a2b0c3b3d0b12729d9b"),
    "value": NumberLong("6237")
  },
  "ts": Timestamp(1592343083, 1),
  "t": NumberLong("15"),
  "wall": ISODate("2020-06-16T21:31:23.021Z"),
  "v": NumberLong("2")
}
Fetched 1 record(s) in 36ms
 
williams-ubuntu:29000(mongod-4.4.0) local> db.oplog.rs.find({"o.value": 6235})
{
  "op": "i",
  "ns": "jepsen.read-concern-majority",
  "ui": UUID("25f9a7ff-ee08-4fc0-a18b-258710151905")),
  "o": {
    "_id": ObjectId("5ee93a2a0c3b3d0b12729d99"),
    "value": NumberLong("6235")
  },
  "ts": Timestamp(1592343082, 11),
  "t": NumberLong("15"),
  "wall": ISODate("2020-06-16T21:31:22.951Z"),
  "v": NumberLong("2")
}
Fetched 1 record(s) in 38ms
 
$ grep "Performing stable" ip-10-122-30-22:20001/mongod.log
...
{"t":{"$date":"2020-06-16T21:31:23.027+00:00"},"s":"D2", "c":"RECOVERY", "id":23986,   "ctx":"WTCheckpointThread","msg":"Performing stable checkpoint. StableTimestamp: {stableTimestamp}, OplogNeededForRollback: {oplogNeededForRollback}","attr":{"stableTimestamp":{"$timestamp":{"t":1592343082,"i":11}},"oplogNeededForRollback":"Timestamp(1592343082, 11)"}}
...

Failure 3

$ grep -A 1 "missing on the" mongodb_test-read-concern-majority_storageEngine-wiredTiger_protocolVersion-1/20200616T161702.000Z/jepsen.log                                                         !10693
The following documents are missing on the secondary ip-10-122-21-101:20002:
{  "_id" : ObjectId("5ee8f3268afc480b1a5df61d"),  "value" : NumberLong(17226) }
 
williams-ubuntu:29000(mongod-4.4.0) local>
> db.oplog.rs.find({op:"i", "o.value": 17226})
{
  "op": "i",
  "ns": "jepsen.read-concern-majority",
  "ui": UUID("e0900a23-8374-4a0a-8257-44dcc788c341")),
  "o": {
    "_id": ObjectId("5ee8f3268afc480b1a5df61d"),
    "value": NumberLong("17226")
  },
  "ts": Timestamp(1592324921, 658),
  "t": NumberLong("40"),
  "wall": ISODate("2020-06-16T16:28:22.567Z"),
  "v": NumberLong("2")
}
Fetched 1 record(s) in 35ms
 
$ grep "Performing stable" ip-10-122-21-101:20002/mongod.log
...
{"t":{"$date":"2020-06-16T16:28:22.594+00:00"},"s":"D2", "c":"RECOVERY", "id":23986,   "ctx":"WTCheckpointThread","msg":"Performing stable checkpoint. StableTimestamp: {stableTimestamp}, OplogNeededForRollback: {oplogNeededForRollback}","attr":{"stableTimestamp":{"$timestamp":{"t":1592324921,"i":657}},"oplogNeededForRollback":"Timestamp(1592324921, 657)"}}
...

This could certainly be a red herring, but it seemed like it could be some kind of lead, so I figured I'd post my findings.

Comment by William Schultz (Inactive) [ 16/Jun/20 ]

I also ran a repro with higher storage level logging verbosity (level 2 for all of STORAGE component).

Comment by William Schultz (Inactive) [ 16/Jun/20 ]

Posting my debugging steps here in more detail for anyone who might continue this investigation.

From my patch build with the repros, I looked at this failure. Jepsen runs a client process which runs operations against the replica set and injects failures, and the logs for that process are contained in the "Jepsen Test Results" directory, which can be downloaded under the "Files" pane in Evergreen. Inside that directory, there are a bunch of files, but if you go down one directory level, there is a jepsen.log file, which contains the log output for the Jepsen client, which includes the output from any dbhash checks. Searching for "dbhash mismatch between primary and secondary" in that file should take you to the offending log message. We also log the exact document that was missing on a particular node that caused the dbhash mismatch:

The following documents are missing on the secondary ip-10-122-21-101:20002:
{  "_id" : ObjectId("5ee8f3268afc480b1a5df61d"),  "value" : NumberLong(17226) }

The logs for that node can also be found inside the "Jepsen Test Results" directory in ip-10-122-21-101/20002/mongod.log.

In my patch build, I modified the Evergreen configuration so that we also save the data files for each node. Those data files are saved in a separate directory, which can be downloaded as "Jepsen mongod Logs" from the Evergreen "Files" pane. After downloading and untar'ing that directory, I started a mongod on the data directory for the node that was missing a document:

./build/install/bin/mongod --port 31000 --dbpath ~/mongodb/scratchpad/WT-6411/ip-10-122-21-101:20002/data

Then I connected to the node from the mongo shell:

./build/install/bin/mongo --port 31000

and tried to see if the document that was missing in the dbhash check was actually gone. In this particular test, all data is inserted into the jepsen.read-concern-majority collection, so we look in there for the document. Note that this test workload is insert only: it simply inserts documents with an increasing numeric 'value' field. Here is what the shell turns up:

williams-ubuntu:31000(mongod-4.4.0) test> show dbs
admin  → 0.000GB
config → 0.000GB
jepsen → 0.000GB
local  → 0.001GB
williams-ubuntu:31000(mongod-4.4.0) test> use jepsen
switched to db jepsen
williams-ubuntu:31000(mongod-4.4.0) jepsen> show tables
read-concern-majority → 0.395MB / 0.184MB
williams-ubuntu:31000(mongod-4.4.0) jepsen> db["read-concern-majority"].find({value: 17226})
Fetched 0 record(s) in 37ms
williams-ubuntu:31000(mongod-4.4.0) jepsen> use local
switched to db local
williams-ubuntu:31000(mongod-4.4.0) local> db.oplog.rs.find({op: "i", "o.value":17226})
{
  "op": "i",
  "ns": "jepsen.read-concern-majority",
  "ui": UUID("e0900a23-8374-4a0a-8257-44dcc788c341")),
  "o": {
    "_id": ObjectId("5ee8f3268afc480b1a5df61d"),
    "value": NumberLong("17226")
  },
  "ts": Timestamp(1592324921, 658),
  "t": NumberLong("40"),
  "wall": ISODate("2020-06-16T16:28:22.567Z"),
  "v": NumberLong("2")
}
Fetched 1 record(s) in 39ms

The document was missing, as we saw in the dbhash check, but the insert that created the document was not missing from the oplog, and we can see its timestamp. Perhaps trying to correlate the timestamp of the insert with a certain checkpoint might give some clues about where the corruption exists.

Comment by Susan LoVerso [ 16/Jun/20 ]

alexander.gorrod I took a quick look at the data directories - there are 5, ip-10-122-20-144:2000[01234]. Looking at the turtle files for each one, the checkpoint LSN is vastly different. Directories 0, 3, 4 have LSNs in log file 2. Directory 1 is in log file 14 and Directory 2 is in log file 3. All the directories have different collection and index file names.

Comment by Susan LoVerso [ 16/Jun/20 ]

Thanks. I had downloaded the Jepsen test results and a couple others. I have them and see the directories now.

Comment by William Schultz (Inactive) [ 16/Jun/20 ]

sue.loverso If you download "Jepsen mongod Logs" and untar it you should get a set of directories, one for each node. The mongo data directory will live inside a directory like ip-10-122-21-101:20000/data.

Comment by Susan LoVerso [ 16/Jun/20 ]

william.schultz which of the Files contains the data files? I downloaded a few that seemed logical and cannot find the database directories to look at the data files.

Comment by William Schultz (Inactive) [ 16/Jun/20 ]

After downloading the data files for one of the failed runs in my repro patch, I am able to observe the same data mismatch reported by the dbhash check from within a mongo shell i.e. one of the secondaries is missing a document that is present on the other nodes. This would seem to imply there is some kind of data corruption of the data on disk. Because of this, there might be more info to be gleaned about where the corruption occurred/exists by looking at the data files from the WT layer.

Comment by William Schultz (Inactive) [ 16/Jun/20 ]

I started to look at this and ran a patch build on the tip of v4.4 to see if I could repro it and capture data files for each mongod, which we don't appear to do by default. Looks like it repro'd a couple of times so I am going to see if the data files might shed any light on the root cause.

Comment by Alexander Gorrod [ 16/Jun/20 ]

I did some spelunking on the Evergreen failure, this might be useful when it comes to reproducing the problem. The test is running:

lein run test --test read-concern-majority                                           \
               --mongodb-dir ../                                                    \
               --working-dir /data/mci/ede1f4aff94a11de99caae9e31459b40/src/jepsen-workdir                          \
               --clock-skew faketime                                                \
               --libfaketime-path /data/mci/ede1f4aff94a11de99caae9e31459b40/src/libfaketime/build/libfaketime.so.1 \
               --mongod-conf mongod_verbose.conf                                    \
               --virtualization none                                                \
               --nodes-file ../nodes.txt                                            \
              --key-time-limit 15                                             \
               --protocol-version 1                                           \
               --storage-engine wiredTiger                                             \
               --time-limit 1200                                                 \
               --write-concern w1

The mongod_verbose.conf file seems to contain:

systemLog:
  component:
    replication:
      rollback:
        verbosity: 2
    storage:
      recovery:
        verbosity: 2
  destination: file
  logAppend: true
  path: {{ path-prefix }}/mongod.log
  verbosity: 1
 
net:
  bindIp: 0.0.0.0
 
setParameter:
  enableTestCommands: true
 
storage:
  dbPath: {{ path-prefix }}/data
  journal:
    enabled: true
#   commitIntervalMs: 1000 # might want to verify journal loss windows
  engine: {{ storage-engine }}
  {% ifequal storage-engine "mmapv1" %}
  mmapv1:
    # Disable data file preallocation. We nuke the entire database between runs
    # so this improves startup speed.
    preallocDataFiles: false
    # Reduce journal size from 1G to 128M
    smallFiles: true
  {% endifequal %}
 
replication:
  replSetName: jepsen
  enableMajorityReadConcern: {{ enable-majority-read-concern }}

It seems as though nodes.txt refers to:

:nodes
[2020/06/15 20:59:12.813] ["ip-10-122-64-239:20000"
[2020/06/15 20:59:12.813] "ip-10-122-64-239:20001"
[2020/06/15 20:59:12.813] "ip-10-122-64-239:20002"
[2020/06/15 20:59:12.813] "ip-10-122-64-239:20003"
[2020/06/15 20:59:12.813] "ip-10-122-64-239:20004"]

but I can't see where that file comes from, or whether Jepsen expects those nodes to already be running and configured.

Comment by Alexander Gorrod [ 16/Jun/20 ]

tess.avitabile and robert.guo the Storage Engines team doesn't know how to get started on chasing this failure. Could you recommend someone we could work with to start digging into what might be going wrong?

I have a few specific questions:

  1. Do you believe it's likely that running the test locally will be worthwhile?
  2. Do you have any hints about how to trace what is leading to the failure? i.e: what set of replica set transitions are happening that lead to the inconsistency?
  3. It is certainly possible that a WiredTiger bug is the root cause here, but the evidence isn't conclusive. Would it make sense for a server engineer to work on this failure as well? If you think so, could you help find someone?
  4. Finally, the first error reported is:

    [2020/06/15 21:21:09.539] ERROR [2020-06-15 21:21:09,538] clojure-agent-send-off-pool-125 - jepsen.mongodb.dbhash dbhash check failed returned non-zero exit status 252 on . STDOUT:

I'm not sure what an error code of 252 is, a google search doesn't give any insight into what that error code means. Is that failure mode familiar to you?

Comment by Susan LoVerso [ 16/Jun/20 ]

alexander.gorrod only one of my 4 runs failed (the 50% 'half bisect' one). I agree that is not something that seems related to a specific commit. I cannot tell what actually happened because loading the raw output hung my browser. I am going to put this back on the backlog as I have no idea how to chase this.

Comment by Alexander Gorrod [ 15/Jun/20 ]

I went back through the history of the job, and I'm not sure my initial analysis was correct. The Jepsen tests weren't completing for a long time (from at least mid April). The first run that finished was on May 15th - that failure has a different signature to the error we are chasing here about dbhash mismatch.

There were then 5 successful runs of the Jepsen test, and then a failure with a signature matching that in this ticket on May 20th.

Since that first failure there have been 27 successful test runs and 13 failed runs (counts are approximate based on scrolling through and counting). So approximately 1 in 3 runs are failing. It's possible that this issue has been present since Jepsen testing started being run on the 4.4 branch of MongoDB.

Generated at Thu Feb 08 05:18:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.