[SERVER-33188] WiredTiger.wt, connection: read checksum error for 4096B block Created: 08/Feb/18  Updated: 06/May/18  Resolved: 09/Feb/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Paule Lecuyer Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Stretch


Attachments: File broken_wt_db.tgz     Text File mongodb.log     Text File mongodb_files_list.txt     File repair-SERVER-33188.tar.gz     File sizeStorer.wt    
Participants:

 Description   

Same problem as SERVER-28555 reported by Edgars Beigarts.

Our system went down due to an electrical blackout. Since then I cannot restart mongo, I get an error "WiredTiger.wt, connection: read checksum error for 4096B block" when I try to start it.

mongod -repair fails with same error.
I also tried to repair WiredTiger.wt file with the following command :
./wt -v -R -h/data/mongodb salvage WiredTiger.wt
But it also fails with the same error.
It looks like WiredTiger.wt file is truncated.

Is there any tool capable to fix this ? Can you help me to repair this file ?

Thanks
Paule



 Comments   
Comment by Ramon Fernandez Marina [ 06/May/18 ]

Apologies for the radio silence plecuyer, we let this one fall through the cracks. Given the corruption you experienced, unfortunately there was no way to do any further recovery of the data. SERVER-19815 is open to attempt to do more in these cases, but the recommendations above are still the best approach.

Regards,
Ramón.

Comment by Paule Lecuyer [ 10/Feb/18 ]

Hello Mark,

We'll apply your recommendations for the future, but at present time unfortunately our backups are too old, and we haven't configured replication for it...
No way then to try to recover the db, even if its requires many operations ?

Thanks,
Paule

Comment by Mark Agarunov [ 09/Feb/18 ]

Hello plecuyer,

Unfortunately, this error indicates that there was corruption on the disk, most often cause by a faulty storage layer. In this situation, our best recommendation would be to resync the affected node or restore from a backup if possible.

To prevent this type of problem in the future please take note of the following guidelines to help mitigate any issues related to unreliable storage layers or server failures.

Thanks,
Mark

Comment by Paule Lecuyer [ 09/Feb/18 ]

Thanks for your quick answering Mark,

I did as you said, but got another error. It seems that sizeStorer.wt file is also corrupted.
I've attached mongodb.log and sizeStorer.wt.

Have a nice day.
Paule

Comment by Mark Agarunov [ 08/Feb/18 ]

Hello plecuyer,

Thank you for your report. I've attached a repair attempt of the files you provided. Please extract these files and replace them in your $dbpath and let us know if it resolves the issue. If you are still seeing errors after replacing these files, please provide the complete logs from the affected node(s) so that we can further investigate. Using systemd to stop mongod should perform a clean shutdown.

Thanks,
Mark

Comment by Paule Lecuyer [ 08/Feb/18 ]

After further investigations, it seems that mongodb corruption did not happen after electrical lockout, but after a "normal" system reboot. Just before rebooting, mongodb was heavily consuming CPU resources.

Does the "mongodb.service" command uncleanly kill the mongo process, causing such corruption ?

Paule.

Comment by Paule Lecuyer [ 08/Feb/18 ]

I have attached the list of all files of mongodb data dir, and the WiredTiger* files
Paule.

Generated at Thu Feb 08 04:32:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.