[SERVER-50788] mongod can not start : file:WiredTiger.wt, connection: read checksum error Created: 08/Sep/20  Updated: 11/Sep/20  Resolved: 10/Sep/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: lxh lxh Assignee: Dmitry Agranat
Resolution: Done Votes: 0
Labels: FA_28558
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

x86_64


Attachments: File WiredTiger.lock     File WiredTiger.turtle     File WiredTiger.wt     File WiredTigerLAS.wt     Zip Archive repair_attempt_SERVER-50788.zip     File sizeStorer.wt    
Participants:

 Description   

2020-09-08T14:25:38.319+0800 I CONTROL [initandlisten] MongoDB starting : pid=25998 port=27018 dbpath=/data/db 64-bit host=server
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] db version v3.4.9
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] git version: 876ebee8c7dd0e2d992f36a848ff4dc50ee6603e
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] allocator: tcmalloc
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] modules: none
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] build environment:
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] distmod: rhel70
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] distarch: x86_64
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] target_arch: x86_64
2020-09-08T14:25:38.320+0800 I CONTROL [initandlisten] options: { config: "/usr/local/mongodb/bin/rs3_member.conf", net: { bindIp: "10.62.124.60", port: 27018, ssl:

{ CAFile: "/usr/local/mongodb/CAfiles/root.pem", PEMKeyFile: "/usr/local/mongodb/CAfiles/server.pem", allowInvalidHostnames: true, mode: "requireSSL" }

}, processManagement: { fork: true }, repair: true, replication: { enableMajorityReadConcern: true, oplogSizeMB: 20480, replSetName: "rs3" }, security: { authorization: "enabled", clusterAuthMode: "x509", keyFile: "/usr/local/mongodb/authentication/keyFile" }, sharding: { archiveMovedChunks: false, clusterRole: "shardsvr" }, storage: { dbPath: "/data/db", directoryPerDB: true, engine: "wiredTiger", journal:

{ enabled: false }

}, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/data/log/mongodb.log", verbosity: 0 } }
2020-09-08T14:25:38.490+0800 I STORAGE [initandlisten] Detected WT journal files. Running recovery from last checkpoint.
2020-09-08T14:25:38.490+0800 I STORAGE [initandlisten] journal to nojournal transition config: create,cache_size=63749M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2020-09-08T14:25:38.502+0800 E STORAGE [initandlisten] WiredTiger error (0) [1599546338:502169][25998:0x7f66aa37de40], file:WiredTiger.wt, connection: read checksum error for 4096B block at offset 401408: block header checksum of 1071605299 doesn't match expected checksum of 1242809853
2020-09-08T14:25:38.502+0800 E STORAGE [initandlisten] WiredTiger error (0) [1599546338:502193][25998:0x7f66aa37de40], file:WiredTiger.wt, connection: WiredTiger.wt: encountered an illegal file format or internal value
2020-09-08T14:25:38.502+0800 E STORAGE [initandlisten] WiredTiger error (-31804) [1599546338:502203][25998:0x7f66aa37de40], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2020-09-08T14:25:38.502+0800 I - [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2020-09-08T14:25:38.502+0800 I - [initandlisten]

***aborting after fassert() failure



 Comments   
Comment by lxh lxh [ 11/Sep/20 ]

I got your suggestion. Anyway, I really appreciate your timely help !

Comment by Dmitry Agranat [ 10/Sep/20 ]

Hi 1554154677@qq.com,

The error message you are receiving indicates that there is an additional corruption. Unfortunately, we do not have any automated process to recover data from this situation.

To avoid a problem like this in the future, it is our strong recommendation to:

Regards,
Dima

Comment by lxh lxh [ 10/Sep/20 ]

Thank you for your help !  It did  not work after replacing the two files, but the error message changed:

 

2020-09-10T18:47:24.930+0800 E STORAGE [initandlisten] WiredTiger error (0) [1599734844:930719][22429:0x7f7cb61e8e40], file:sizeStorer.wt, WT_SESSION.open_cursor: read checksum error for 4096B block at offset 24576: block header checksum of 394736567 doesn't match expected checksum of 1046701883
2020-09-10T18:47:24.930+0800 E STORAGE [initandlisten] WiredTiger error (0) [1599734844:930762][22429:0x7f7cb61e8e40], file:sizeStorer.wt, WT_SESSION.open_cursor: sizeStorer.wt: encountered an illegal file format or internal value
2020-09-10T18:47:24.930+0800 E STORAGE [initandlisten] WiredTiger error (-31804) [1599734844:930776][22429:0x7f7cb61e8e40], file:sizeStorer.wt, WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
2020-09-10T18:47:24.930+0800 I - [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2020-09-10T18:47:24.930+0800 I - [initandlisten]

***aborting after fassert() failure 

 

The wrong file changes to sizeStorer.wt and I have attached it in the attachments. Can it be repaired? 

Comment by Dmitry Agranat [ 10/Sep/20 ]

Hi 1554154677@qq.com,

I've attached a repair attempt of the files you provided as repair_attempt_SERVER-50788.zip. Please extract these files, replace them in your $dbpath, and let us know if it resolves the issue.

Thanks,
Dima

Comment by lxh lxh [ 10/Sep/20 ]

The Secondary is a new node and it needs to resync data from the Primary. So the key point is to restore the Primary.

Also, if --repair, it needs to update to 4.0+ and then the whole mongodb cluste may take a long time. I could not wait.

For the log saying " file:WiredTiger.wt, connection: read checksum error for 4096B block at offset 401408: block header checksum of 1071605299 doesn't match expected checksum of 1242809853", map you help repair the WiredTiger.wt file in the attachments? As the follows repair_attempt.tar.gz:https://jira.mongodb.org/browse/SERVER-46728

Thx.

Comment by Dmitry Agranat [ 09/Sep/20 ]

Basically, you need to do a Maintenance on a Replica Set Member where you start a member as a standalone, do maintenance (in this case, a --repair) and restart it as a Replica Set member. Please let me know how it goes.

Also, based on the provided logs, it seems that only Primary hit this issue. I did not see any issues with the Secondary.

Comment by lxh lxh [ 09/Sep/20 ]

Yes ,I did but the replica set can't start so the node can't be removed by the command "rs.remove". Is there any other way to remove the node from the replica set ? Thx.

 

In case , all the mongod log files are uploaded.

 

 

Comment by Dmitry Agranat [ 09/Sep/20 ]

Yes, --repair should be done against a standalone node which is being removed from a replica set for this procedure. Did you try doing this?
Can you upload compressed mongod logs from all members to the provided secure upload portal?

Comment by lxh lxh [ 09/Sep/20 ]

Hi Dima,

Because all the 3-node mongod process start failed after unexpected shutdown, it can not resync from the primary node.

 

About  mongod --repair  , it can not be used by replica set. Do I understand wrongly?

 

In case, I  tried to restore the WiredTiger files by the WiredTiger tool ,and the command is " ./wt -v -h /data/bak -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R salvage collection-38878–7827210234374637134.wt"

 

Finally,the files are uploaded.

 

By the way,I  saw other people's  solved question the same as mine,as the follows:

https://jira.mongodb.org/browse/SERVER-46728

 

Thanks

lxh

Comment by Dmitry Agranat [ 09/Sep/20 ]

Hi 1554154677@qq.com,

As MongoDB 3.4 has reached EOL, we can try to assist you as a one-time exception.

Your configuration shows a 3-node replica set. The ideal resolution is to perform a clean resync from an unaffected node. In the event a resync of the failed member fails, please provide the logs covering this resync time

You can also try mongod --repair using the latest version of MongoDB.

In the event that a --repair operation is unsuccessful, then please also provide:

  • The logs of the repair operation.
  • The logs of any attempt to start mongod after the repair operation completed.

When you said:

I had tried to restore the WiredTiger files but failed

Could you please clarify what "restore" means here (detailed steps)?

In case you need to upload mongod logs, I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by lxh lxh [ 09/Sep/20 ]

Thank you! I had tried to restore the WiredTiger files but failed, so I really need your help emergently for the broken product server.

Comment by Tim Fogarty [ 08/Sep/20 ]

Hi 1554154677@qq.com, I'm moving this ticket to the SERVER project where we deal with errors related to mongod.

Generated at Thu Feb 08 05:23:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.