[SERVER-27224] file:WiredTiger.wt read checksum error, mongodb won't start Created: 30/Nov/16  Updated: 13/Aug/18  Resolved: 14/Mar/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: rudolp esquilon Assignee: David Hows
Resolution: Done Votes: 0
Labels: envm, rfi, rpu, trcf, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     File repair_attempt.tar.gz    
Operating System: Linux
Participants:

 Description   

Hi,

I got the same issue with this one:
https://jira.mongodb.org/browse/SERVER-25285

thank you!



 Comments   
Comment by rudolp esquilon [ 14/Mar/17 ]

I see..anyway thank you

Comment by David Hows [ 14/Mar/17 ]

Hi Rudolph,

Sorry for the large delay in getting back to you.

I've tried to get a repair of your data-set going, and got to the point you did. I then made some minor journal modifications to work around the issue you were seeing. From there, I started looking at the integrity of all your files. None of the database files were able to validate or be read, all were corrupted.

I also did a manual examination of one of the internal catalog tables (_mdb_catalog.wt), which has a fixed BSON String format for data (unlike any of the data containing collections which have no fixed schema). In this file I found strings from JSON documents and other things that looked highly out of place. Given this, I believe that something external to MongoDB (likely the filesystem) has caused corruption in your database files.

Comment by Kelsey Schubert [ 20/Dec/16 ]

Hi rud0lp20,

Thanks for uploading the files. Unfortunately, this type of post-mortem requires a significant effort to analyze the files. We'll update this ticket after we conclude our investigation.

Kind regards,
Thomas

Comment by rudolp esquilon [ 15/Dec/16 ]

Hi,

are there any news?

Comment by rudolp esquilon [ 08/Dec/16 ]

Hi,

I've already uploaded all required files...
thanks

Comment by rudolp esquilon [ 07/Dec/16 ]

ok thank you

however we don't have backup files yet....

Comment by Kelsey Schubert [ 06/Dec/16 ]

Hi rud0lp20,

Unfortunately, since the repair attempt was unsuccessful, my advice would be to perform an initial sync or restore from a backup.

Before doing so, would you be able to provide the complete $dbpath for us to investigate this issue?

I've created a secure upload portal where you can provide logs following the repair attempt and data files. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you for your help,
Thomas

Comment by rudolp esquilon [ 06/Dec/16 ]

up..any news?
thanks

Comment by rudolp esquilon [ 01/Dec/16 ]

Hi,

1. actually we reboot the server and didn't manually stop it
2. we increase CPU and RAM
3. after rebooting and upgrading the server(mongodb is not running) we just move(mv command) the db directory content to other location
4. nope the logs shows the same error about checksum error

thanks

Comment by Kelsey Schubert [ 01/Dec/16 ]

Hi rud0lp20,

Thank you for the answers, I have a few follow up questions to help us in our investigation.

  1. Would you please clarify whether you hit an OOM killer during the index build? If not, how did you shutdown the mongod?
  2. What was your upgrade process?
  3. How did you move the db files?
  4. Was the repair attempt successful?

Thanks again,
Thomas

Comment by rudolp esquilon [ 01/Dec/16 ]

hi Thomas,

here is my answer
1. Before it happens we are having issue in completing an indexing process due to lack of RAM in our server..so after we do an upgrade and reboot the problem occurs when we about to start the mongodb server the issue shows up..ow also before that we move the location of the db files to another directory/disk that has a larger space
2. nope we haven't
3. We are using SSD attach to Digital Ocean server...((as part of a storage system, as known as SAN – Storage Area Network which is a cluster of storagae drive only in a network environment).)
4. We have and its in good condition

Comment by Kelsey Schubert [ 30/Nov/16 ]

Hi rud0lp20,

I've attempted a repair of the uploaded files. Please extract them and replace them in your dbpath.

I have a few questions to get a better understanding of what happened here, but please understand that in cases like this we may not be able to identify the root cause from the information you provide.

  1. Preceding the corruption, were there any other server errors logged? Did an unclean shutdown occur? If so, what was the cause?
  2. Have you recently run out of disk space?
  3. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
  4. Would you please consider checking the integrity of your disks?

Thank you,
Thomas

Comment by rudolp esquilon [ 30/Nov/16 ]

FYI

[1480529755:805623][22119:0x7f5b50b16740], file:WiredTiger.wt, WT_CURSOR.next: read checksum error for 24576B block at offset 57344: calculated block checksum of 4001194994 doesn't match expected checksum of 4050905503
[1480529755:805795][22119:0x7f5b50b16740], file:WiredTiger.wt, WT_CURSOR.next: WiredTiger.wt: encountered an illegal file format or internal value
[1480529755:805818][22119:0x7f5b50b16740], file:WiredTiger.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
[1480529755:805839][22119:0x7f5b50b16740], txn-recover: Recovery failed: WT_PANIC: WiredTiger library panic


2016-11-30T10:55:50.367-0500 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=18G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (2) [1480521350:421985][17221:0x7f960a3fbcc0], txn-recover: /data/db2/default/journal/WiredTigerLog.0000001585: handle-open: open: No such file or directory
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (0) [1480521350:422073][17221:0x7f960a3fbcc0], txn-recover: WiredTiger is unable to read the recovery log.
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (0) [1480521350:422088][17221:0x7f960a3fbcc0], txn-recover: This may be due to the log files being encrypted, being from an older version or due to corruption on disk
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (0) [1480521350:422097][17221:0x7f960a3fbcc0], txn-recover: You should confirm that you have opened the database with the correct options including all encryption and compression options
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (0) [1480521350:422283][17221:0x7f960a3fbcc0], file:WiredTiger.wt, WT_CURSOR.next: read checksum error for 24576B block at offset 57344: calculated block checksum of 4001194994 doesn't match expected checksum of 4050905503
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (0) [1480521350:422304][17221:0x7f960a3fbcc0], file:WiredTiger.wt, WT_CURSOR.next: WiredTiger.wt: encountered an illegal file format or internal value
2016-11-30T10:55:50.422-0500 E STORAGE [initandlisten] WiredTiger (-31804) [1480521350:422318][17221:0x7f960a3fbcc0], file:WiredTiger.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2016-11-30T10:55:50.422-0500 I - [initandlisten] Fatal Assertion 28558

Generated at Thu Feb 08 04:14:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.