[SERVER-66250] WT_PANIC when non-fsynced writes are lost Created: 05/May/22  Updated: 27/Oct/23  Resolved: 02/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kyle Kingsbury Assignee: Keith Smith
Resolution: Works as Designed Votes: 0
Labels: Crash, WiredTiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.tar.bz2    
Operating System: ALL
Steps To Reproduce:

I'm afraid a straightforward repro case will have to wait for the release of LazyFS–hopefully in a few weeks–but if you have the opportunity to maybe stub out the loss of un-fsynced writes yourself, that might be one possible way to reproduce this!

Participants:

 Description   

When mongod 4.4.9's data directory loses writes which were not explicitly fsynced, attempts to restart mongod can throw WT_PANIC during the WAL recovery process.

LazyFS is a new FUSE filesystem (presently unreleased, but we hope to publish it shortly) which simulates the effects of power loss by maintaining an in-memory page cache for all writes, and only flushing pages to its backing store when explicitly asked to fsync. We've integrated LazyFS into Jepsen by mounting /var/lib/mongodb on LazyFS, and after killing mongod, we ask LazyFS to drop any unfsynced writes. Here's an example of this test in action. Nodes n1, n3, and n6 all panicked on restart. Here's n3 beginning recovery:

{"t":{"$date":"2022-05-05T10:59:26.822-04:00"},"s":"W",  "c":"STORAGE",  "id":22302,   "ctx":"initandlisten","msg":"Recovering data f
rom the last clean checkpoint."}
...
 {"t":{"$date":"2022-05-05T10:59:59.581-04:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger messag
e","attr":{"message":"[1651762799:581115][1200823:0x7f257df12cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 1 through 
2"}}
{"t":{"$date":"2022-05-05T11:00:00.043-04:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger messag
e","attr":{"message":"[1651762800:43098][1200823:0x7f257df12cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 2 through 2
"}}
{"t":{"$date":"2022-05-05T11:00:01.001-04:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger messag
e","attr":{"message":"[1651762801:623][1200823:0x7f257df12cc0], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main reco
very loop: starting at 1/92800 to 2/256"}}

Roughly 35 seconds later, the process crashed, complaining of pread failing to read a byte at a specific offset:

{"t":{"$date":"2022-05-05T11:00:01.475-04:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":-31802,"message":"[1651762801:475745][1200823:0x7f257df12cc0], file:WiredTiger.wt, WT_CURSOR.next: __posix_file_read, 428: /var/lib/mongodb/WiredTiger.turtle: handle-read: pread: failed to read 1 bytes at offset 1456: WT_ERROR: non-specific WiredTiger error"}}
{"t":{"$date":"2022-05-05T11:00:01.475-04:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":-31809,"message":"[1651762801:475901][1200823:0x7f257df12cc0], file:WiredTiger.wt, WT_CURSOR.next: __wt_turtle_read, 387: WiredTiger.turtle: fatal turtle file read error: WT_TRY_SALVAGE: database corruption detected"}}
{"t":{"$date":"2022-05-05T11:00:01.475-04:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1651762801:475935][1200823:0x7f257df12cc0], file:WiredTiger.wt, WT_CURSOR.next: __wt_turtle_read, 387: the process must exit and restart: WT_PANIC: WiredTiger library panic"}} 

Since this is a new filesystem and we haven't sanded off all the bugs yet, I'm not certain whether we haven't introduced some sort of unexpected corruption in the filesystem beyond what would be caused by the loss of unsynced writes–but it does seem odd that the whole process had to crash given at least some of the WAL was apparently readable. Does this sound like expected behavior?



 Comments   
Comment by Keith Smith [ 02/Jun/22 ]

Glad to help.  As I said, LazyFS looks like useful tool.  I'll be curious to hear how your future testing goes.

I'll go ahead and close this as you suggest.

Comment by Kyle Kingsbury [ 01/Jun/22 ]

Excellent find! The LazyFS team and I have confirmed that this is, in fact, a bug in LazyFS. They've fixed this issue (and several more) but our newly expanded test suite for LazyFS itself still shows some bugs remaining in the filesystem which we'd like to address before testing Mongo again. Let's go ahead and close this issue.

Thank you!

Comment by Keith Smith [ 24/May/22 ]

Thanks, aphyr@jepsen.io, for following up and providing the contents of the MongoDB database directory from your test.

Here's what the end of the WiredTiger.turtle file looks like:

$ hexdump -c -s 0x550 WiredTiger.turtle 
0000550   _   t   x   n   =   -   1   1   ,   p   r   e   p   a   r   e
0000560   =   0   ,   w   r   i   t   e   _   g   e   n   =   1   2   ,
0000570   r   u   n   _   w   r   i   t   e   _   g   e   n   =   9   )
0000580   )   ,   c   h   e   c   k   p   o   i   n   t   _   b   a   c
0000590   k   u   p   _   i   n   f   o   =   ,   c   h   e   c   k   p
00005a0   o   i   n   t   _   l   s   n   =   (   2   ,   2   5   6   )
00005b0  \n  \0 

The newline character at position 0x5b0 is the last thing that WiredTiger writes into the file. So that's where the file should end.  The additional null character suggests a problem somewhere that is causing the file size to be one byte longer than it should be.

Since you provided a pointer to LazyFS, I downloaded and installed it on a test system. In playing with it a little from the shell, it does look like there is a bug in how it handles rename operations.  Here is an example:

$ cd /mnt/lazyfs
$ echo "This is a small file" > small
$ echo "This is a slightly larger file" > large
$ ls -l
total 8
-rw-r--r-- 1 admin admin 31 May 24 18:28 large
-rw-r--r-- 1 admin admin 21 May 24 18:28 small
$ mv small large
$ ls -l
total 4
-rw-r--r-- 1 admin admin 31 May 24 18:36 large
$ cat large
This is a slightly larger file

The issue here is that after renaming small to large, LazyFS is still showing the original size and contents of large, rather than the size and content from small.  This isn't the same symptom that we're seeing with the WiredTiger.turtle file, but it does seem that the LazyFS rename implementation is not behaving in the way our software expects. Note that I am not injecting any faults here.

I built and installed LazyFS following the directions on GitHub. The only change I made was to set the fifo_path parameter in config/default.toml to point to a location on my system. I see this behavior both with the head of the LazyFS main branch and at the commit Jepsen seems to be using.

Please let me know if I am misunderstanding something about how I should be using LazyFS or about the behavior it is supposed to provide.

Comment by Kyle Kingsbury [ 23/May/22 ]
  1. Yep, this is linux–specifically Debian 11, kernel 5.10.0-13-cloud-amd64 Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux, running in both AWS VMs and in local (also Debian) LXC containers.

2. Sure looks like the file sizes are the same...

 

admin@ip-172-31-9-175:/var/lib/mongodb$ ls -l WiredTiger.turtle 
-rw------- 1 mongodb nogroup 1458 May 23 18:23 WiredTiger.turtle
admin@ip-172-31-9-175:/var/lib/mongodb$ wc -c < WiredTiger.turtle 
1458

 

Here's the whole data dir, in case it's helpful: mongod.tar.bz2

3. I'm afraid I don't have before-and-after snapshots, and I'm actually not sure how to even get those reliably, because lazyfs doesn't support consistent snapshots like ZFS, and presumably MongoDB is writing data the whole time–also I don't know in advance which nodes are going to be corrupted, haha. That said, I do have reproduction instructions now, and the lazyfs repo is now public, so you can run these tests yourself! Clone https://github.com/jepsen-io/mongodb, check out 44af5b7a43d8a51fe29ac907e4842efd807805b1, and run:

 

lein run test --username admin --nodes-file ~/nodes --hidden 1 --read-concern majority --txn-read-concern snapshot --write-concern majority --txn-write-concern majority --nemesis kill --lazyfs --time-limit 120 --nemesis-interval 20 --leave-db-running

 

That'll leave the database running so you can log in to nodes and see exactly what's going on after the fact.

It kinda looks like it's trying to read the offset just after the end of the file–e.g. when a node logs `failed to read 1 bytes at offset 1456`, the WiredTiger.turtle file is exactly 1456 bytes in size.

4. And no, I'm afraid it wasn't inspired by Torturing Databases for Fun and Profit–I actually had this in mind when I first designed Jepsen back in 2013, but it's taken a while to make time to work on it! I know FoundationDB did testing like this (with actual physical power faults too!) and there've been several filesystem-level fault injection papers with related work in recent years as well–this one among them.

Comment by Keith Smith [ 18/May/22 ]

Thanks again, aphyr@jepsen.io, for reporting this.

To start with your final question, no, it's our intention that the system recovery from any combination of unsynced writes making it to disk (or not making it to disk) before a failure. In the worst case, the state of the system should reflect the last successful fsync call.

With that in mind, I'm seeing two different error messages in the logs you provided, and I'd like to work with you to understand them:

  • Nodes n1 and n3 are both encountering the "__posix_file_read, 428: /var/lib/mongodb/WiredTiger.turtle: handle-read: pread: failed to read 1 bytes at offset 1456: WT_ERROR: non-specific WiredTiger error" you reported above. This looks like the file system returned EOF when WiredTiger was still expecting one more byte of data in the WiredTiger.turtle file. 
  • The n6 node is encountering "__wt_block_checkpoint_resolve, 928: the process must exit and restart: WT_PANIC: WiredTiger library panic". In this case WiredTiger is failing when it tries to parse a configuration string. The string appears to have been truncated and is missing a closing paren – i.e., a single character. This makes me suspect that in this case the file system returned one too few bytes from the WiredTiger.turtle file.

These look like two different symptoms of a mismatch between the file size and the contents of the WiredTiger.turtle file. Because this file typically changes size by only a couple bytes in either direction, I suspect when we update the file we are somehow winding up with the new contents and an old size, or vice versa.

The problem with this hypothesis is that it doesn’t match the way WiredTiger updates the turtle file. WiredTiger updates this file atomically by first creating a temporary file, writing the updated content to that file in one or two write calls, performing a flush and fsync, and then renaming the temporary file to WiredTiger.turtle. So if there is a problem with the file size on the new version of the turtle file, I would expect to see either a file size of zero, or the size after the first write, which would be a few tens of bytes.

I’m not sure the delayed application of un-fsynced writes is involved here. During recovery, WiredTiger reads the turtle file, applies the WAL, rewrites the turtle file, then proceeds with normal startup. Based on the MongoDB log files you provided, the failures are all happening after WiredTiger has completed recovery and rewritten the turtle file. I.e., after the "WiredTiger opened" message. So it appears that there was no intervening crash between writing and reading back the turtle file.

Since LazyFS is a new file system, is it possible it has a bug in tracking the size of small files or updating file sizes across a rename call, or is doing something else differently that we don't expect during the rename?

If not, we will need some additional information to better understand what is happening here.

Can you:

  1. confirm that you are running on Linux (which I infer from the fact that you’re using FUSE).
  2. confirm that the size of /var/lib/mongodb/WiredTiger.turtle reported by ls -l matches the number of bytes in the file reported by something like wc -c < /var/lib/mongodb/WiredTiger.turtle.
  3. and provide both the WiredTiger.turtle and WiredTiger.wt files before and after a reproduction of the issue? I.e., after your simulated power failure, and again after you hit the error trying to restart MongoDB.

BTW, LazyFS is a great idea for a testing tool. It reminds me of the Torturing databases for fun and profit paper from a few years ago. I don't suppose that was an inspiration?

Comment by Edwin Zhou [ 11/May/22 ]

Hi aphyr@jepsen.io, thank you for reporting this ticket. We will pass this along to the storage engines team to further investigate this issue.

Best,
Edwin

Generated at Thu Feb 08 06:04:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.