[SERVER-37117] WiredTiger library panic Created: 13/Sep/18  Updated: 21/Sep/18  Resolved: 19/Sep/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Irene Lee [X] Assignee: Nick Brewer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Windows
Participants:

 Description   

How to repair my DB with WT library panic?

2018-09-13T17:44:38.137+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2018-09-13T17:44:38.589+0800 I CONTROL [main] Trying to start Windows service 'MongoDB'
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] MongoDB starting : pid=1332 port=27017 dbpath=F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\data 64-bit host=conan-211
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] db version v3.4.10
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] git version: 078f28920cb24de0dd479b5ea6c66c644f6326e9
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] allocator: tcmalloc
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] modules: none
2018-09-13T17:44:38.590+0800 I CONTROL [initandlisten] build environment:
2018-09-13T17:44:38.591+0800 I CONTROL [initandlisten] distmod: 2008plus
2018-09-13T17:44:38.591+0800 I CONTROL [initandlisten] distarch: x86_64
2018-09-13T17:44:38.591+0800 I CONTROL [initandlisten] target_arch: x86_64
2018-09-13T17:44:38.591+0800 I CONTROL [initandlisten] options: { config: "F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\mongod.cfg", net:

{ bindIp: "127.0.0.1", port: 27017 }

, operationProfiling:

{ slowOpThresholdMs: 1000 }

, service: true, storage: { dbPath: "F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\data", journal:

{ enabled: false }

, syncPeriodSecs: 10.0, wiredTiger: { engineConfig:

{ configString: "direct_io = (data)" }

} }, systemLog:

{ destination: "file", logAppend: true, path: "F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\log\MongoDB.log" }

}
2018-09-13T17:44:38.603+0800 I - [initandlisten] Detected data files in F:\Program Files\Microsoft Advanced Threat Analytics\Center\MongoDB\bin\data created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2018-09-13T17:44:38.604+0800 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=15871M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=10,log_size=2GB),statistics_log=(wait=0),direct_io = (data),log=(enabled=false),
2018-09-13T17:44:38.641+0800 E STORAGE [initandlisten] WiredTiger error (0) [1536831878:640687][1332:140709419767520], file: WiredTiger.wt, connection: WiredTiger.turtle: encountered an illegal file format or internal value
2018-09-13T17:44:38.641+0800 E STORAGE [initandlisten] WiredTiger error (-31804) [1536831878:641689][1332:140709419767520], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-09-13T17:44:38.641+0800 I - [initandlisten] Fatal Assertion 28558 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 361
2018-09-13T17:44:38.641+0800 I - [initandlisten]

***aborting after fassert() failure



 Comments   
Comment by Nick Brewer [ 21/Sep/18 ]

jolmedo I'm glad you were able to get up and running again. Performing an initial sync is the recommended way - alternatively you can manually copy over a file snapshot as outlined here.

Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

-Nick

Comment by Jorge Olmedo [ 21/Sep/18 ]

Hi Nick. First of all, thanks for your time & attention.

We are "up-and-running" again after a huge repair to the MongoDB instance (27 hours) affected by the WiredTiger corruption data file. Everything works fine now. I just have a doubt & I hope you can help me. There's a distance (in time) among primary & secondary replica set members (that primary is the affected one), so resync the affected node is the choice we have to attack. We choose an initial sync, removing dbpath content of the secondary while it is off, and then start it up. Is it the best choice? Thanks in advance.

Jorge

Comment by Nick Brewer [ 19/Sep/18 ]

Irene Glad to hear it you got it working - I'll go ahead and close this ticket.

Some considerations to prevent storage-related issues in the future:

-Nick

Comment by Nick Brewer [ 19/Sep/18 ]

jolmedo Sorry to hear you're running into problems after an unclean shutdown. In your case, the best option is going to be to resync the affected node.

That said, if you're still running into any error messages related to WiredTiger corruption once you've performed a resync, please feel free to open a separate ticket.

-Nick

Comment by Irene Lee [X] [ 19/Sep/18 ]

Hi Nick,
After discussing internally, we restore the VM to repair MongoDB. everything works well. Thanks a lot.

Irene

Comment by Jorge Olmedo [ 19/Sep/18 ]

Hi Nick

I'm facing this exact issue in my deployment. I'm running MongoDB 3.4.10, a shard cluster with 5 primary nodes & their replica sets. One of them suffered an unclean shutdown, we are still finding out why but my guess is a VMWare ESX process which checks availability on all virtual machines, and moves machines from one point to another. I remember having read some notes in MongoDB online documentation about avoiding this kind of things, but guys responsible for this task did not take my opinion when they deployed MongoDB cluster to a new hardware. So, here I am, begging for help.

If I didn't misunderstand what I read above, you have a procedure to repair this files. In my case, files WiredTiger.turtle & WiredTiger.wt are OK, issue happened in another file, named collection-4–xxxx.wt. Size of it is about 23 Gb, so I think there's no way to upload it. DO you thing it would possible for you to let me know how you perform the repairing procedure? It would be a great help.

Thanks in advance for your time.

Jorge.

 

Comment by Irene Lee [X] [ 13/Sep/18 ]

Hi Nike
Thanks for your quick reply
I will discuss with my team, and get back to you.
The information and files I need to provid is WiredTiger.wt and WiredTiger.turtle files, and the following:

The cause of the failure (power failure, unclean shutdown, file corruption, etc) >> I think it is file corruption.

The platform (virtual machine, container, native hardware)

Irene

Comment by Nick Brewer [ 13/Sep/18 ]

Irene The process only takes a few minutes.

It's unlikely the root cause here is a bug - it looks like you have journaling disabled which, as the documentation states, is not recommended on production systems. However we do track all reported instances of WiredTiger corruption and would like to collect as much information on this as we can.

-Nick

Comment by Irene Lee [X] [ 13/Sep/18 ]

Hi Nike,
Thanks for reply.
May I know how long you need to repaired files after I provide the information and broken files?

Besides, is it a bug?

Comment by Nick Brewer [ 13/Sep/18 ]

Irene Yes, we would provide repaired files to be used in place of the current ones.

-Nick

Comment by Irene Lee [X] [ 13/Sep/18 ]

Hi Nick
May I know what your repair action on it?
Will you provide me two repaired WT files to replace the old one? Or...?

Thanks,
Irene

Comment by Nick Brewer [ 13/Sep/18 ]

Irene If you upload the WiredTiger.wt and WiredTiger.turtle files from your dbpath, we can perform a repair attempt. Before doing so we need to confirm:

  • The cause of the failure (power failure, unclean shutdown, file corruption, etc)
  • The platform (virtual machine, container, native hardware)

Thanks,
-Nick

Generated at Thu Feb 08 04:45:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.