[SERVER-19879] WT_SESSION.checkpoint: WiredTiger.wt read error: failed to read 4096 bytes at offset 32768: WT_ERROR: non-specific WiredTiger error Created: 12/Aug/15 Updated: 24/Aug/15 Resolved: 18/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chung-yen Chang | Assignee: | Chung-yen Chang |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | Run YCSB load phase (I suspect a workload that does a large number of inserts with similar sized documents: ~1K would do the same) on a standalone mongod. |
||||
| Participants: | |||||
| Linked BF Score: | 0 | ||||
| Description |
|
While running the load phase of a YCSB test, wiredTiger encountered an error and panic. This has happened 3 out of 4 times so it seems fairly reproducible. The workload is doing a large number of inserts of documents that are 1K each. The mongod log reports some slow inserts (>100 msec) and then showed the following line about wiredTiger. This was seen on an EC2 instance running Enterprise version of mongod (3.1.7-pre, git version: 1b5b55cb48ff9daca74177b86d336613d207c5d1).
One the console where the YCSB test were run, these error repeated many times around the problem and are mostly related. Part of this error message that is interesting is where it says "Caused by: java.io.EOFException), which could be related to the error in mongod log.
|
| Comments |
| Comment by Chung-yen Chang [ 18/Aug/15 ] | |||||||||||||||||
|
I tried the suggestion by michael.cahill to use "killall -w" in the script. After that I have rerun the same script multiple times without hitting this issue again. At this point, I think it is safe to assume that the problem was a duplicate of the earlier issue that Michael has mentioned, that the mongod being killed didn't die cleanly and somehow messed up the new one that is being started. | |||||||||||||||||
| Comment by Chung-yen Chang [ 14/Aug/15 ] | |||||||||||||||||
|
michael.cahill Mm, that sounds possible but I do have some sleep time between steps because of some other issues I ran into before. But let me give this a try and hopefully that's it. I will convert the killall in my script to include "-w". Fingers crossed and I will update when I am back on line. Thanks. | |||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 14/Aug/15 ] | |||||||||||||||||
|
chung-yen.chang, we have seen errors similar to this if the old mongod has not actually shut down before the new one starts. An old mongod can overwrite WiredTiger.turtle because the manual rm -rf has removed the lock file that normally protects against this kind of thing. Can you check if killall -w avoids this error? | |||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 14/Aug/15 ] | |||||||||||||||||
|
chung-yen.chang, are you running a standard YCSB workload, or is this something a little different? Also, would it be possible for me to get ssh access to your instance, run it myself? Thanks! | |||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 13/Aug/15 ] | |||||||||||||||||
|
Thanks for the additional information, chung-yen.chang. From the log, it looks like WiredTiger recovery ran for 15 minutes on startup, but MongoDB didn't detect an unclean shutdown. That is a little odd, but doesn't explain what is happening later. Then the workload runs for over two hours (meaning probably ~100 checkpoints complete successfully), before a checkpoint fails. This failure indicates that as part of a checkpoint, WiredTiger is expecting to read a block from its metadata (the "WiredTiger.wt" file). The file is 32KB and we're trying to read a block of 4KB that starts at the 32KB offset, so the read fails. This indicates that either WiredTiger's metadata has become corrupted (specifically, WiredTiger.turtle is out of sync with WiredTiger.wt) or that the WiredTiger.wt file has been truncated incorrectly since the last checkpoint completed. keith.bostic, how would you suggest tracking this down? Could chung-yen.chang turn on some categories of verbose message while reproducing? | |||||||||||||||||
| Comment by Chung-yen Chang [ 13/Aug/15 ] | |||||||||||||||||
|
michael.cahill, this turned out to be uglier than I thought. I was about to give up because I haven't seen this problem for 5 tries in a row, then it hit again. These transient problems are the worst to debug. You have also asked for the size of the files and here is the "ls -l" output of the database directory. There is about 64G of space left on the volume.
My next step is to try and reproduce this again, and try to capture a core dump when it happens (just learned how to enable that after this last attempt). If there are other information that can help, please let me know. | |||||||||||||||||
| Comment by Chung-yen Chang [ 12/Aug/15 ] | |||||||||||||||||
|
michael.cahill, after hitting this 3 out of 4 times, it stopped showing. I tired to create a simple insert script for a repro. It didn't hit. I re-ran the original YCSB load twice, and it didn't hit. When I hit it 3 out of 4 times, I thought it is easy to reproduce but now it's proving otherwise. I will try it again and see if I can reproduce, and then capture the log and check the file size when it happens. | |||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 12/Aug/15 ] | |||||||||||||||||
|
chung-yen.chang, if you can reproduce this, the full mongod.log would be helpful to start with. In particular, was there ever an unclean shutdown? It would also be good to know the size of the WiredTiger.wt file after this failure: is it 32KB or less in size? |