[SERVER-66467] WiredTiger checksum error Created: 14/May/22  Updated: 02/Jun/22  Resolved: 18/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ahmed Nasr Assignee: Chris Kelly
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File WiredTiger.turtle     File WiredTiger.wt    
Issue Links:
Duplicate
is duplicated by SERVER-66468 WiredTiger checksum error Closed
Operating System: ALL
Participants:

 Description   

After repairing the db due to disk corruption, the db kept failing due to this error:

2022-05-14T10:19:35.986+0000 E  STORAGE  [conn1137] WiredTiger error (0) [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206 Raw: [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206
2022-05-14T10:19:35.986+0000 E  STORAGE  [conn1137] WiredTiger error (0) [1652523575:986814][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_bm_corrupt_dump, 135: {14536716288, 12288, 0xd8f2e206}: (chunk 1 of 12): 00 00 00 00 00 00 00 00 71 5a 80 00 00 00 00 00 b7 1a 01 00 06 00 00 00 07 05 00 00 00 40 00 00 1e 57 e3 f0 01 00 00 00 11 e3 3f c9 17 80 e2 1f 05 85 3f 00 00 07 5f 69 64 00 61 c0 dc dd f7 b7 23 3d 00 00 00 00 00 00 f7 b4 04 f0 3c 08 1f 5c 7f 83 65 02 75 73 65 72 6e 61 6d 65 00 0f 00 00 00 32 39 30 31 31 30 37 31 35 30 30 30 39 3.........



 Comments   
Comment by Ahmed Nasr [ 19/May/22 ]

Hi Chris,
Thanks for your help and insights. I wish you the best.

Best regards,
Ahmed

Comment by Chris Kelly [ 18/May/22 ]

Hi Ahmed,

Thanks for providing a good timeline of events on this. Without logs, we can't really discern much more information, but there is a chance that the primary node is sustaining some sort of persistent issue that is happening irregularly. This could've led to latent corruption on the node that just happened to be a problem when the secondary required an initial sync, requiring the primary to visit pages with corrupt data (and then causing the error you mentioned).

Among corruption cases, checksum failures are usually caused by file system or a disk issues, so we are somewhat suspect of glusterfs. There appear to be ways this can lead to data loss in certain situations. We can't speak much to it, but there appears to be a few examples of corruption that happen to those using Openshift with glusterfs in the past, including a mention of it taking place in SERVER-40088 with other database systems.

Since we don't have much more information to go off of, and you have already resolved the issue, I'm going to go ahead and close this ticket for now.

Regards,
Christopher

Comment by Ahmed Nasr [ 17/May/22 ]

Hi Chris,

Our cluster consists of 3 nodes, 1 primary, 1secondary and 1 arbiter.

The cluster is deployed on openshift with glusterfs as the underlying file system.

What happened is:

1- The secondary node failed due to similar error.

2- The whole volume of the secondary was removed and the node was started again with a clean volume.

3- The secondary joined the cluster and started syncing from the primary.

4- During syncing the primary failed with the corruption error.

5- The secondary then failed as it didn't sync enough data to be able to act as primary.

6- The primary was restarted automatically and was healthy.

7- Then the cycle from 3 to 6 kept repeating.

Comment by Chris Kelly [ 17/May/22 ]

Hi ahmed.nasr@fixedmea.com,

It's hard to say without logs which would include exact information about your setup. Corruption can happen in any number of ways. If you still have logs of this event please upload them to the support link if you can. 

We would be especially interested in figuring out why your node's initial sync failed. Did you get the same exact error you reported mid-sync on the fresh node? Or did it happen on the other node at any point? If you could provide a clearer timeline of the events that would be helpful for future issues.

However, we do have some guidelines that should cover some common reasons for this.

To avoid a problem like this in the future, it is our strong recommendation to:

 Regards,
Christopher

Comment by Ahmed Nasr [ 17/May/22 ]

Hi Chris,

Thanks for your support. Unfortunately it was a major incident and we couldn't afford to wait. We tried repairing but unfortunately it stated that it has to sync from another node but it was down. Actually it was still re syncing from this node and the node failed due to that error half way. We had to restore a previous backup on a clean new replica set.

But for future reference, what do you think might have caused such corruption? So we would watch for it in our new deployment?

Comment by Chris Kelly [ 17/May/22 ]

Hi ahmed.nasr@fixedmea.com,

The ideal resolution is to perform a clean resync from an unaffected node. In your case, I'd recommend that next if you are running a replica set.

You can also try mongod --repair using the latest patch of your version (in your case, 4.2.20).

In the event that running --repair using 4.2.20 is unsuccessful, then please provide the following:

  • The logs leading up to the first occurrence of any issue
  • The logs of the repair operation.
  • The logs of any attempt to start mongod after the repair operation completed.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Regards,
Christopher

Generated at Thu Feb 08 06:05:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.