[SERVER-34743] How to recover from "WiredTiger.wt, WT_CURSOR.next: read checksum error" ? Created: 29/Apr/18  Updated: 27/Jul/18  Resolved: 01/May/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Froz Costa Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: envns, rge, rpu, trcf, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File WiredTiger     File WiredTiger.turtle     File WiredTiger.turtle     File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     File repair_attempt.tar.gz     File sizeStorer.wt     File storage.bson    
Operating System: ALL
Participants:

 Description   

Hi,

Today I faced the problem that many already reported. The server crashed and the database is corrupted. I can't bring it up to normal.

I've tried to start mongod with --repair ... nothing...

How can I recover from this? I can see from many posts that you guys generate the repaired files and then we can start mongod. But really, how can we recover from this problem wtihout opening a ticket or depending on someone to fix the checksum at the WiredTiger file?

I am attaching the files for analysis... BTW, I utilise --directoryperdb at my mongod instance.

– LOGS

2018-04-29T18:37:03.912-0300 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=493M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2018-04-29T18:37:04.330-0300 E STORAGE  [initandlisten] WiredTiger error (0) [1525037824:330136][1126:0x7f34f3033cc0], file:WiredTiger.wt, WT_CURSOR.next: read checksum error for 16384B block at offset 643072: block header checksum of 4091829093 doesn't match expected checksum of 2132028001
2018-04-29T18:37:04.330-0300 E STORAGE  [initandlisten] WiredTiger error (0) [1525037824:330189][1126:0x7f34f3033cc0], file:WiredTiger.wt, WT_CURSOR.next: WiredTiger.wt: encountered an illegal file format or internal value
2018-04-29T18:37:04.330-0300 E STORAGE  [initandlisten] WiredTiger error (-31804) [1525037824:330199][1126:0x7f34f3033cc0], file:WiredTiger.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-04-29T18:37:04.330-0300 I -        [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361



 Comments   
Comment by Kelsey Schubert [ 01/May/18 ]

Hi daniel.froz,

Unfortunately, this error indicates that there was corruption on the disk, most often caused by a faulty storage layer. In this situation, our best recommendation would be to resync the affected node or restore from a backup if possible.

To prevent this type of problem in the future please take note of the following guidelines to help mitigate any issues related to unreliable storage layers or server failures.

Thank you,
Kelsey

Comment by Daniel Froz Costa [ 01/May/18 ]

Hi Kelsey,

Now it's failing with error 100.

about to fork child process, waiting until server is ready for connections.

forked process: 1252

ERROR: child process failed, exited with error number 100

 

From the logs I can see the following:

2018-05-01T10:45:18.111-0300 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=256M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),

2018-05-01T10:45:19.994-0300 E STORAGE  [initandlisten] WiredTiger error (-31802) [1525182319:990528][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: unable to read root page from file:sizeStorer.wt: WT_ERROR: non-specific WiredTiger error

2018-05-01T10:45:19.998-0300 E STORAGE  [initandlisten] WiredTiger error (-31802) [1525182319:998130][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: operation apply failed during recovery: operation type 4 at LSN 111/18702464: WT_ERROR: non-specific WiredTiger error

2018-05-01T10:45:19.998-0300 E STORAGE  [initandlisten] WiredTiger error (0) [1525182319:998213][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: WiredTiger is unable to read the recovery log.

2018-05-01T10:45:19.998-0300 E STORAGE  [initandlisten] WiredTiger error (0) [1525182319:998226][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: This may be due to the log files being encrypted, being from an older version or due to corruption on disk

2018-05-01T10:45:19.998-0300 E STORAGE  [initandlisten] WiredTiger error (0) [1525182319:998232][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: You should confirm that you have opened the database with the correct options including all encryption and compression options

2018-05-01T10:45:19.998-0300 E STORAGE  [initandlisten] WiredTiger error (-31802) [1525182319:998249][1252:0x7f933903ecc0], file:sizeStorer.wt, txn-recover: Recovery failed: WT_ERROR: non-specific WiredTiger error

2018-05-01T10:45:20.018-0300 I -        [initandlisten] Assertion: 28595:-31802: WT_ERROR: non-specific WiredTiger error src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 267

2018-05-01T10:45:20.029-0300 I STORAGE  [initandlisten] exception in initAndListen: 28595 -31802: WT_ERROR: non-specific WiredTiger error, terminating

 

I also tried to run with --repair option. No success.

Cheers

Daniel Froz

Comment by Kelsey Schubert [ 30/Apr/18 ]

Hi daniel.froz,

Thank you for report. I've attached a repair attempt, repair_attempt.tar.gz, of the files you provided. Please extract these files and replace them in your $dbpath and let us know if it resolves the issue. If you are still seeing errors after replacing these files, please provide the complete logs from the affected node so that we can further investigate.

Thank you,
Kelsey

Comment by Daniel Froz Costa [ 30/Apr/18 ]

Hi,

I am having trouble to attach the file as requested. "File XXXX was not uploaded. An internal error has occurred. Please contact your administrator...

I can see it's a text file... so I am copying the content below:

– START HERE

WiredTiger version string

WiredTiger 2.9.2: (December 23, 2016)

WiredTiger version

major=2,minor=9,patch=2

file:WiredTiger.wt

access_pattern_hint=none,allocation_size=4KB,app_metadata=,block_allocation=best,block_compressor=,cache_resident=false,checkpoint=(WiredTigerCheckpoint.100331=(addr="01c00781e406748afbc01481e4a0963c8bc01581e4c85ddb83808080e30c4fc0e305efc0",order=100331,time=1525036325,size=401408,write_gen=536294)),checkpoint_lsn=(111,18702464),checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=0,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=S,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=0,log=(enabled=true),memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=75,value_format=S,version=(major=1,minor=1)

– END HERE

Many thanks for the promptly reply!! Really appreciated!

Cheers

Daniel Froz

Comment by Ramon Fernandez Marina [ 30/Apr/18 ]

daniel.froz, if you upload the WiredTiger.turtle file we can attempt to repair the catalog files.

Regards,
Ramón.

Generated at Thu Feb 08 04:37:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.