[SERVER-66467] WiredTiger checksum error Created: 14/May/22 Updated: 02/Jun/22 Resolved: 18/May/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ahmed Nasr | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
After repairing the db due to disk corruption, the db kept failing due to this error: 2022-05-14T10:19:35.986+0000 E STORAGE [conn1137] WiredTiger error (0) [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206 Raw: [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206 |
| Comments |
| Comment by Ahmed Nasr [ 19/May/22 ] |
|
Hi Chris, Best regards, |
| Comment by Chris Kelly [ 18/May/22 ] |
|
Hi Ahmed, Thanks for providing a good timeline of events on this. Without logs, we can't really discern much more information, but there is a chance that the primary node is sustaining some sort of persistent issue that is happening irregularly. This could've led to latent corruption on the node that just happened to be a problem when the secondary required an initial sync, requiring the primary to visit pages with corrupt data (and then causing the error you mentioned). Among corruption cases, checksum failures are usually caused by file system or a disk issues, so we are somewhat suspect of glusterfs. There appear to be ways this can lead to data loss in certain situations. We can't speak much to it, but there appears to be a few examples of corruption that happen to those using Openshift with glusterfs in the past, including a mention of it taking place in Since we don't have much more information to go off of, and you have already resolved the issue, I'm going to go ahead and close this ticket for now. Regards, |
| Comment by Ahmed Nasr [ 17/May/22 ] |
|
Hi Chris, Our cluster consists of 3 nodes, 1 primary, 1secondary and 1 arbiter. The cluster is deployed on openshift with glusterfs as the underlying file system. What happened is: 1- The secondary node failed due to similar error. 2- The whole volume of the secondary was removed and the node was started again with a clean volume. 3- The secondary joined the cluster and started syncing from the primary. 4- During syncing the primary failed with the corruption error. 5- The secondary then failed as it didn't sync enough data to be able to act as primary. 6- The primary was restarted automatically and was healthy. 7- Then the cycle from 3 to 6 kept repeating. |
| Comment by Chris Kelly [ 17/May/22 ] |
|
It's hard to say without logs which would include exact information about your setup. Corruption can happen in any number of ways. If you still have logs of this event please upload them to the support link if you can. We would be especially interested in figuring out why your node's initial sync failed. Did you get the same exact error you reported mid-sync on the fresh node? Or did it happen on the other node at any point? If you could provide a clearer timeline of the events that would be helpful for future issues. However, we do have some guidelines that should cover some common reasons for this. To avoid a problem like this in the future, it is our strong recommendation to:
Regards, |
| Comment by Ahmed Nasr [ 17/May/22 ] |
|
Hi Chris, Thanks for your support. Unfortunately it was a major incident and we couldn't afford to wait. We tried repairing but unfortunately it stated that it has to sync from another node but it was down. Actually it was still re syncing from this node and the node failed due to that error half way. We had to restore a previous backup on a clean new replica set. But for future reference, what do you think might have caused such corruption? So we would watch for it in our new deployment? |
| Comment by Chris Kelly [ 17/May/22 ] |
|
The ideal resolution is to perform a clean resync from an unaffected node. In your case, I'd recommend that next if you are running a replica set. You can also try mongod --repair using the latest patch of your version (in your case, 4.2.20). In the event that running --repair using 4.2.20 is unsuccessful, then please provide the following:
Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Regards, |