[SERVER-48389] WT read checksum error, what could be the cause? Created: 23/May/20 Updated: 27/May/20 Resolved: 27/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Carl Champain (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: |
| Description |
|
Hello! On May 20 at 14:33 (UTC+2) my primary server crashed suddenly (no specific job or load) with this error:
The fail-over went well fortunately, one of the secondary became primary. I had a look at this error which looks like corrupted data read (not sure if it's from disk or memory?) the server is fairly new, I ran some tests on the disk and memory but didn't find any problem. Can do you provide a bit more details about what could cause this error? anyway I could make sure if it's hardware or software? anything to investigate in mongodb code itself? After that (May 20 at 22:21 UTC+2) I tried restarting mongo (without clearing the data dir) to see if it can recover by itself or not and ended up in a weird state: mongo seemed to start normally from the logs, I didn't see anything supicious related to corrupted data (maybe you'll find some). The log says the server went into rollback state and then secondary, but then talks about rollback again, here is an extract:
But at this point, it was not catching up, not doing anything visible (no CPU or disk activity) and I could NOT connect to it, when starting the mongo shell it would initiate the connection but hang before displaying the prompt (no error). In the logs I only see a bunch of (probably from your monitoring daemon?):
Impossible to get a shell and the server doesn't move. Even doing "sudo systemctl stop mongodb" did NOT stop the server, it hanged. I had to send a SIGTERM to stop it (I think systemctl also send SIGTERM so not sure why, maybe it was just slow and I got lucky on the timing). I think there was not enough headway on the other servers to catch-up sync so it may be expected that this server stayed stuck, but I did not expect to be unable to get prompt and send commands. Is this expected? After that I decided to clear the data dir and do an initial sync so unfortunately I don't have the diagnostic.data any more, but I have the complete log if you want. |
| Comments |
| Comment by Carl Champain (Inactive) [ 27/May/20 ] |
|
Yes, this error is most likely about disk failure. |
| Comment by Adrien Jarthon [ 27/May/20 ] |
|
Ok thanks, so this error is more about disk failure not RAM? As I don't have the data dir (I have most of the rest though) let's not spend our time trying to dig more and it'll probably lead nowhere. If I see this again I'll definitely try to keep the data dir. |
| Comment by Carl Champain (Inactive) [ 27/May/20 ] |
|
The WT read checksum error leads us to suspect some form of physical corruption. If you encounter this problem again in the future holding onto a copy of the database's $dbpath directory would be helpful. Our ability to determine the source of corruption would also depend on your ability to provide:
We recommend you keep an eye out for that host.
The ideal resolution is to perform a clean resync from an unaffected node. Otherwise unexpected behavior may occur and the corruption problem won't be fixed. I will now close this ticket. Kind regards, |