[SERVER-37117] WiredTiger library panic Created: 13/Sep/18 Updated: 21/Sep/18 Resolved: 19/Sep/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Irene Lee [X] | Assignee: | Nick Brewer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | Windows |
| Participants: |
| Description |
|
How to repair my DB with WT library panic? 2018-09-13T17:44:38.137+0800 I CONTROL [main] ***** SERVER RESTARTED ***** , operationProfiling: { slowOpThresholdMs: 1000 }, service: true, storage: { dbPath: "F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\data", journal: { enabled: false }, syncPeriodSecs: 10.0, wiredTiger: { engineConfig: { configString: "direct_io = (data)" }} }, systemLog: { destination: "file", logAppend: true, path: "F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\log\MongoDB.log" } } ***aborting after fassert() failure |
| Comments |
| Comment by Nick Brewer [ 21/Sep/18 ] |
|
jolmedo I'm glad you were able to get up and running again. Performing an initial sync is the recommended way - alternatively you can manually copy over a file snapshot as outlined here. Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group. -Nick |
| Comment by Jorge Olmedo [ 21/Sep/18 ] |
|
Hi Nick. First of all, thanks for your time & attention. We are "up-and-running" again after a huge repair to the MongoDB instance (27 hours) affected by the WiredTiger corruption data file. Everything works fine now. I just have a doubt & I hope you can help me. There's a distance (in time) among primary & secondary replica set members (that primary is the affected one), so resync the affected node is the choice we have to attack. We choose an initial sync, removing dbpath content of the secondary while it is off, and then start it up. Is it the best choice? Thanks in advance. Jorge |
| Comment by Nick Brewer [ 19/Sep/18 ] |
|
Irene Glad to hear it you got it working - I'll go ahead and close this ticket. Some considerations to prevent storage-related issues in the future:
-Nick |
| Comment by Nick Brewer [ 19/Sep/18 ] |
|
jolmedo Sorry to hear you're running into problems after an unclean shutdown. In your case, the best option is going to be to resync the affected node. That said, if you're still running into any error messages related to WiredTiger corruption once you've performed a resync, please feel free to open a separate ticket. -Nick |
| Comment by Irene Lee [X] [ 19/Sep/18 ] |
|
Hi Nick, Irene |
| Comment by Jorge Olmedo [ 19/Sep/18 ] |
|
Hi Nick I'm facing this exact issue in my deployment. I'm running MongoDB 3.4.10, a shard cluster with 5 primary nodes & their replica sets. One of them suffered an unclean shutdown, we are still finding out why but my guess is a VMWare ESX process which checks availability on all virtual machines, and moves machines from one point to another. I remember having read some notes in MongoDB online documentation about avoiding this kind of things, but guys responsible for this task did not take my opinion when they deployed MongoDB cluster to a new hardware. So, here I am, begging for help. If I didn't misunderstand what I read above, you have a procedure to repair this files. In my case, files WiredTiger.turtle & WiredTiger.wt are OK, issue happened in another file, named collection-4–xxxx.wt. Size of it is about 23 Gb, so I think there's no way to upload it. DO you thing it would possible for you to let me know how you perform the repairing procedure? It would be a great help. Thanks in advance for your time. Jorge.
|
| Comment by Irene Lee [X] [ 13/Sep/18 ] |
|
Hi Nike The cause of the failure (power failure, unclean shutdown, file corruption, etc) >> I think it is file corruption. The platform (virtual machine, container, native hardware) Irene |
| Comment by Nick Brewer [ 13/Sep/18 ] |
|
Irene The process only takes a few minutes. It's unlikely the root cause here is a bug - it looks like you have journaling disabled which, as the documentation states, is not recommended on production systems. However we do track all reported instances of WiredTiger corruption and would like to collect as much information on this as we can. -Nick |
| Comment by Irene Lee [X] [ 13/Sep/18 ] |
|
Hi Nike, Besides, is it a bug? |
| Comment by Nick Brewer [ 13/Sep/18 ] |
|
Irene Yes, we would provide repaired files to be used in place of the current ones. -Nick |
| Comment by Irene Lee [X] [ 13/Sep/18 ] |
|
Hi Nick Thanks, |
| Comment by Nick Brewer [ 13/Sep/18 ] |
|
Irene If you upload the WiredTiger.wt and WiredTiger.turtle files from your dbpath, we can perform a repair attempt. Before doing so we need to confirm:
Thanks, |