[SERVER-23346] WiredTiger.wt File corrupted (yet another one) Created: 25/Mar/16  Updated: 13/Aug/18  Resolved: 13/Apr/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: CK Lee Assignee: Unassigned
Resolution: Done Votes: 0
Labels: docker, envc, rge, rpu, trcf, wtc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File WiredTiger.turtle     File WiredTiger.wt     Zip Archive metrics.zip     File repair_attempt.tgz     Text File screendump.txt    
Operating System: Linux
Participants:

 Description   

I am running mongodb 3.2.4 in a Docker container hosted in Kubernetes cluster.

After restarting a kubernetes node restart, mongodb failed to start.
I have tried the repair command

mongod --port 27000 --dbpath /data/dbfix --repair

But I am still getting the error below.

2016-03-25T07:59:26.865+0000 E STORAGE  [initandlisten] WiredTiger (-31802) [1458892766:865859][78:0x7f191a3afc80], file:WiredTiger.wt, connection: unable to read root page from file:WiredTiger.wt: WT_ERROR: non-specific WiredTiger error
2016-03-25T07:59:26.866+0000 E STORAGE  [initandlisten] WiredTiger (0) [1458892766:866236][78:0x7f191a3afc80], file:WiredTiger.wt, connection: WiredTiger has failed to open its metadata
2016-03-25T07:59:26.866+0000 E STORAGE  [initandlisten] WiredTiger (0) [1458892766:866365][78:0x7f191a3afc80], file:WiredTiger.wt, connection: This may be due to the database files being encrypted, being from an older version or due to corruption on disk
2016-03-25T07:59:26.866+0000 E STORAGE  [initandlisten] WiredTiger (0) [1458892766:866607][78:0x7f191a3afc80], file:WiredTiger.wt, connection: You should confirm that you have opened the database with the correct options including all encryption and compression options



 Comments   
Comment by basharat tamboli [X] [ 16/Mar/17 ]

Hi David,
thanks,

You need the entire dbpath as it was when you took the backup. MongoDB with WiredTiger performs consistency checks on boot by attempting to open each collection within the system. The list of collections is stored in both the WiredTiger.wt and _mdb_catalog.wt files, which also have integrity checking. If any one of the collections in the system at time of backup is not present or fails initial checksum validation then your instance will abort during boot.

this is verified, I tried copying files related to a particular collection ( collection-.wt ,index-.wt file and _mdb_catalog.wt file ) but it fails.

But now I take backup of full dbpath and if i restore it back to the same database even after few changes in the database like dropping a collection, everything works fine all processes up and running but the moment I try any partial restore(I mostly try collection restore) by copying few files from backup two things happen

  1. processes won't come up due to error(Now i know the error is due to consistency checks at process boot ,thanks to David.)
  2. (this happens when I copy collection-.wt ,index-.wt file related to that collection and {{ _mdb_catalog.wt}} file along with Wiredtiger* .wt files to dppath as restore) processes come up real fine but my database won't show up and I can see all information about that database in config database. but not in everywhere

Note: I am using sharded cluster but I tried for a stand alone mongod process too same things happen.

If you wish to restore a single collection, then you would need to first restore the whole dbpath into a stand-alone instance, then use database commands to drop the unwanted collections

I want to restore it back to the same database also logically dropping all unwanted collections would be real pain and not a flexible solution don't you think?

Comment by David Hows [ 16/Mar/17 ]

Hi Basharat,

I was trying to do restore of a specific collection with the use of its data file I wanted to know which files are related to a particular collection which needs to backup in order to get that collection back(restored).

You need the entire dbpath as it was when you took the backup. MongoDB with WiredTiger performs consistency checks on boot by attempting to open each collection within the system. The list of collections is stored in both the WiredTiger.wt and _mdb_catalog.wt files, which also have integrity checking. If any one of the collections in the system at time of backup is not present or fails initial checksum validation then your instance will abort during boot.

If you wish to restore a single collection, then you would need to first restore the whole dbpath into a stand-alone instance, then use database commands to drop the unwanted collections.

Comment by basharat tamboli [X] [ 15/Mar/17 ]

First of all thank you very much for this information, very less information is available about this.( I am looking forward to more of it if you could point me to any documents related to this, would be really nice.)

I was trying to do restore of a specific collection with the use of its data file I wanted to know which files are related to a particular collection which needs to backup in order to get that collection back(restored).

i tried copying collection-*.wt, its related index file along with {{ _mdb_catalog.wt}} file . but it doesn't work.

Comment by Alexander Gorrod [ 15/Mar/17 ]

WiredTiger.wt Contains information that tracks the state of the different tables in a WiredTiger database, including information about the most recent stable written data. WiredTiger flushes that file to disk regularly because it is necessary to be certain about it's contents in order to re-open the database safely. To translate that into MongoDB - each MongoDB instance has a single WiredTiger database behind it, and at the moment each collection and index resides in a different WiredTiger table.

WiredTiger.turtle Contains the metadata for the WiredTiger.wt metadata file, i.e: Where WiredTiger.wt contains information about the content of each table in a database, WiredTiger.turtle contains information about the content of WiredTiger.wt

sizeStorer.wt is a WiredTiger table populated by MongoDB that contains information about the size of collections and indexes. This is done as an optimization - because it is expensive to retrieve accurate document counts and size from WiredTiger (or most storage engines).

Comment by basharat tamboli [X] [ 15/Mar/17 ]

more information on wiredtiger.wt, wiredtiger.turtle,sizestorer.wt would be a big help.

Comment by Alexander Gorrod [ 18/Apr/16 ]

chaokoon@gmail.com

My observation is that WiredTiger.wt (metadata) corrupts a lot more often than the collection.wt. But I'm not sure whether that is worth looking into or it is just unique to my case?

I believe I can explain why you see corruption in the WiredTiger metadata more often than in other collections. The WiredTiger metadata needs to be flushed to stable storage each time a change is made to add or remove a collection/index as well as periodically for other reasons, wheras other collections only have their data flushed periodically. So the metadata is likely to be receiving a lot of flush operations. If the underlying disk storage subsystem does not provide reliable flush operations - corruption is more likely to occur in the metadata file than in the collection and index files.

Comment by CK Lee [ 16/Apr/16 ]

For manual recovery of WiredTiger.wt, these are my steps to recover. For reference to other users who faced this similar issue and have no better ways to recover the database.
I manage to recover my database this way without losing any data.

1) Make a backup of your corrupted database first if you have not already done so. This recovery is very risky as it may wipe out your entire database.
2) Obtain the latest working copy of WiredTiger.wt and WiredTiger.turtle from your backup. (Note down the updated time for the files)
3) Install wiredtiger-2.7.0 with snappy plugin see http://www.alexbevi.com/blog/2016/02/10/recovering-a-wiredtiger-collection-from-a-corrupt-mongodb-installation/
4) Override the working copy of WiredTiger.wt and WiredTiger.turtle to your dbpath
5) List the collection .wt files that have been modified since the last updated date of WiredTiger.wt backup file.
5) Run wt salvage only for the collection .wt files that have been modified since the last updated date of the backup file.
eg: ./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R load -f ../collection.dump -r collection-2-880383588247732034
6) Run mongod --repair to fix checksum error of sizeStorer.wt and rebuild the indices.

Comment by CK Lee [ 16/Apr/16 ]

Thanks anonymous.user. I have restore from my backup and migrated the database from Kubernetes cluster to run on a docker with the volume mounted locally based on your recommendation. I also downgraded to 3.2.3.

I have tried to reproduce the scenario by terminating the pod unexpectedly, but it wasn't very consistent.. My observation is that WiredTiger.wt (metadata) corrupts a lot more often than the collection.wt. But I'm not sure whether that is worth looking into or it is just unique to my case?

Comment by Kelsey Schubert [ 13/Apr/16 ]

Hi ck-lee,

Unfortunately, this behavior indicates that the corruption was not limited to the WiredTiger.wt file. In this circumstance, my best recommendation would be to restore using your last good back up.

There is very little any database can do about disk-level corruption, other than detect there is a problem. In particular, this issue appears to be the result of using NFS improperly. I would strongly recommend considering a different setup with local storage.

If you do choose to continue to run on this environment, I would recommend at least setting up a replica set, which would provide a level of fault tolerance against the loss of a single database server.

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. Questions regarding MongoDB setup would be best posted on the mongodb-users group.

Kind regards,
Thomas

Comment by CK Lee [ 29/Mar/16 ]

Thanks Ramon. I am still having issues when I try to start or repair the database. I will attach the screen dump. When I did the --repair, it started to drop all my collections.

Comment by Ramon Fernandez Marina [ 28/Mar/16 ]

ck-lee, I've uploaded the repair_attempt.tgz file with the result of a repair attempt. Please extract it on your dbpath and try again.

Please note the limitations for running on NFS in our documentation, and make sure your provider follows the necessary requirements.

SERVER-19815 is open to improve what --repair can be done in this an other situations, so feel free to watch it and/or vote for it.

Lastly, if you are able to reproduce these problems in a simple environment (e.g.: local nfs mount) and want to share your results with us that would be of great help, as it would help with SERVER-19815.

Thanks,
Ramón.

Comment by CK Lee [ 25/Mar/16 ]

Can you help me recover the WiredTiger.wt and WiredTiger.turtle files because I'm stucked now as I realized my last good backup is on the 9th of March ? I'm happy to try it myself if you can provide a guide with how you helped https://jira.mongodb.org/browse/SERVER-23122

Comment by CK Lee [ 25/Mar/16 ]

The setup that corrupts the WiredTiger.wt is

  • A standalone mongodb instance. (Not a replica-set)
  • The mongodb daemon runs on a kubernetes pod using the official mongodb:3.2.4 docker image.
  • The files in /data/db is persisted using kubernetes persistent volume (nfs mount). Hosted on a separate file server (coreos).
  • The pod can be moved to different nodes without causing issues if the pod is shutdown properly.

I had an outage on a node, which caused the pod not to shutdown properly. And when kubernetes trying to start the mongodb pod in a different node. It fails because the WiredTiget.wt is corrupted. Only files in /data/db is persisted, so I don't have other log files. But I do have the diagnostic.data folder. I hope that helps.

In my case, I believe it is an operational problem which corrupted WiredTiger.wt file because mongod did not exit properly. But the issue is the database cannot be recovered by running --repair flag.

Would you like me to try and reproduce this to help determine if there is a bug that causes WiredTiger.wt on an unclean exit? I am keen to help to prevent this in the future.

Comment by Ramon Fernandez Marina [ 25/Mar/16 ]

ck-lee, can you please provide some more details on the setup and, specifically, how was this node restarted? Being able to see the logs a few minutes before and after the restart should help us determine if this is a bug in the server or an operational problem.

Thanks,
Ramón.

Comment by CK Lee [ 25/Mar/16 ]

I have attached my WiredTiger.wt and WiredTiger.turtle file. Can this be recovered? Thank you.

Comment by CK Lee [ 25/Mar/16 ]

This issue is similar to https://jira.mongodb.org/browse/SERVER-23122

Generated at Thu Feb 08 04:03:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.