[SERVER-29705] mongod start fail with error: read checksum error Created: 17/Jun/17  Updated: 05/Jul/17  Resolved: 19/Jun/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: chenyuliang Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 01.zip     Zip Archive 02.zip     File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     Text File mongod.log     File repair_attempt-01.tar.gz     File repair_attempt-02.tar.gz     File repair_attempt.tar.gz    
Issue Links:
Duplicate
is duplicated by SERVER-29706 machine power off leads to read check... Closed
Participants:

 Description   

hi. I have a problem with mongod. There is a replication with one primary and two secondary. I find all the three mongods can not start with this error:

2017-06-17T23:30:21.525+0800 E STORAGE  [initandlisten] WiredTiger error (0) [1497713421:525505][10001:0x7f9cc3cd0ce0], file:WiredTiger.wt, WT_CURSOR.insert: read checksum error
for 24576B block at offset 233472: block header checksum of 3121098103 doesn't match expected checksum of 3702490606
2017-06-17T23:30:21.525+0800 E STORAGE  [initandlisten] WiredTiger error (0) [1497713421:525580][10001:0x7f9cc3cd0ce0], file:WiredTiger.wt, WT_CURSOR.insert: WiredTiger.wt: encou
ntered an illegal file format or internal value
2017-06-17T23:30:21.525+0800 E STORAGE  [initandlisten] WiredTiger error (-31804) [1497713421:525604][10001:0x7f9cc3cd0ce0], file:WiredTiger.wt, WT_CURSOR.insert: the process mus
t exit and restart: WT_PANIC: WiredTiger library panic
2017-06-17T23:30:21.525+0800 I -        [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2017-06-17T23:30:21.525+0800 I -        [initandlisten]
 
***aborting after fassert() failure

How can I fix this problem? Thank you!



 Comments   
Comment by Kelsey Schubert [ 05/Jul/17 ]

Hi chenyuliang01,

I've attached repair attempts for the files from the other nodes.

Kind regards,
Thomas

Comment by chenyuliang [ 21/Jun/17 ]

Hi Thomas,
There is no backup. I have three members in this replication. So can you please try to fix the other two's WiredTiger.wt files as attached? I try to restart them.

Thank you!

Comment by Kelsey Schubert [ 20/Jun/17 ]

Hi chenyuliang01,

Unfortunately, that's correct – the error indicates the collection file has been corrupted. In cases like this, there is very little mongod can do. Therefore, I would recommend performing a clean initial sync or, if that is not possible, restoring from a backup.

Kind regards,
Thomas

Comment by chenyuliang [ 20/Jun/17 ]

Hi @Thomas Schubert,

Thank you for you reply. But when I tried to restart mongod after replace these files, I got errors like this:

2017-06-20T11:11:24.129+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=32768M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=zlib),file_manager=(close_idle_time=100000),checkpoint=(wait=1,log_size=2GB),statistics_log=(wait=0),verbose=(recovery_progress),
2017-06-20T11:11:24.938+0800 I STORAGE  [initandlisten] WiredTiger message [1497928284:938186][17387:0x7f8036e28ce0], txn-recover: Main recovery loop: starting at 2528/12392320
2017-06-20T11:11:24.938+0800 I STORAGE  [initandlisten] WiredTiger message [1497928284:938534][17387:0x7f8036e28ce0], txn-recover: Recovering log 2528 through 2536
2017-06-20T11:11:24.938+0800 E STORAGE  [initandlisten] WiredTiger error (-31802) [1497928284:938905][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: unable to read root page from file:local/collection-0--34980363597192622.wt: WT_ERROR: non-specific WiredTiger error
2017-06-20T11:11:24.938+0800 E STORAGE  [initandlisten] WiredTiger error (-31802) [1497928284:938951][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: operation apply failed during recovery: operation type 4 at LSN 2528/12392320: WT_ERROR: non-specific WiredTiger error
2017-06-20T11:11:24.938+0800 E STORAGE  [initandlisten] WiredTiger error (0) [1497928284:938968][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: WiredTiger is unable to read the recovery log.
2017-06-20T11:11:24.938+0800 E STORAGE  [initandlisten] WiredTiger error (0) [1497928284:938983][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: This may be due to the log files being encrypted, being from an older version or due to corruption on disk
2017-06-20T11:11:24.939+0800 E STORAGE  [initandlisten] WiredTiger error (0) [1497928284:939016][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: You should confirm that you have opened the database with the correct options including all encryption and compression options
2017-06-20T11:11:24.939+0800 E STORAGE  [initandlisten] WiredTiger error (-31802) [1497928284:939038][17387:0x7f8036e28ce0], file:local/collection-0--34980363597192622.wt, txn-recover: Recovery failed: WT_ERROR: non-specific WiredTiger error
2017-06-20T11:11:24.941+0800 I -        [initandlisten] Assertion: 28595:-31802: WT_ERROR: non-specific WiredTiger error src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 273
2017-06-20T11:11:24.941+0800 I STORAGE  [initandlisten] exception in initAndListen: 28595 -31802: WT_ERROR: non-specific WiredTiger error, terminating
2017-06-20T11:11:24.941+0800 I NETWORK  [initandlisten] shutdown: going to close listening sockets...

And fail to restart. Does it mean that my collection data file is corrupted too?

Thank you!

Comment by Kelsey Schubert [ 19/Jun/17 ]

Hi chenyuliang01,

Thank you for clarifying how this corruption occurred. I've attached a repair attempt of the files; would you please replace these files in your dbpath and restart mongod? If there is no additional corruption mongod should be able to successfully restart.

Thank you,
Thomas

Comment by chenyuliang [ 19/Jun/17 ]

This problem is caused by IDC power off. So it is not a bug.

Generated at Thu Feb 08 04:21:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.