[SERVER-81835] potential hardware corruption, read checksum error: block header checksum doesn't match the expected checksum. Created: 04/Oct/23  Updated: 07/Nov/23  Resolved: 07/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.3, 6.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: 비 서 Assignee: Noopur Gupta
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 7.9


Operating System: ALL
Participants:

 Description   

A total of 12 servers are configured as a ReplicaSet to operate a Shard Cluster, with each server having three nodes grouped together.

Over the past month, approximately five secondary nodes have encountered issues with the message "potential hardware corruption, read checksum error: block header checksum doesn't match the expected checksum." Attempts to resolve the problem using the repair command have been unsuccessful, and the issue has persisted. Ultimately, the only effective solution was to delete the data and perform a resynchronization.

  • Repair :  Link
  • Delete Date and Resync : Link

However, deleting the data and resyncing is not a practical solution due to the large data capacity of around 25TB. Determining the root cause of this issue has proven to be challenging.

How can I resolve this issue?



 Comments   
Comment by Noopur Gupta [ 07/Nov/23 ]

Closing this ticket since there is no activity. Feel free to open the ticket if the issue still persists.

Comment by Noopur Gupta [ 30/Oct/23 ]

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the above logs after following the steps for sync ?

Comment by Noopur Gupta [ 09/Oct/23 ]

Hi,

This error message leads us to suspect some form of physical corruption. Please make a complete copy of the database's $dbpath directory to safeguard so that you can work off of the current $dbpath.

Since, this is a replica set, the ideal resolution is to perform a clean resync from an unaffected node.

You can also try mongod --repair using the latest version of MongoDB.

If the issue with {{--repair }}still persists after the above steps are performed,  then please also provide:

  • The logs leading up to the first occurrence of any issue
  • The logs of the repair operation.
  • The logs of any attempt to start mongod after the repair operation completed.
  • The command issued to repair the nodes

 

Thanks

Noopur

Generated at Thu Feb 08 06:47:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.