[SERVER-19815] Improved mongod --repair option for WiredTiger Created: 06/Aug/15  Updated: 22/Jun/22  Resolved: 18/Sep/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.5
Fix Version/s: 4.0.3, 4.1.4

Type: Improvement Priority: Major - P3
Reporter: Michael Cahill (Inactive) Assignee: Louis Williams
Resolution: Done Votes: 37
Labels: nyc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-12088 Docs for SERVER-19815: Make repair mo... Closed
Duplicate
is duplicated by SERVER-23532 WT Library Panic Closed
is duplicated by SERVER-22816 Corrupt metadata after unexpected shu... Closed
is duplicated by SERVER-32451 Cannot start mongod with a missing wi... Closed
is duplicated by TOOLS-1496 provide tool to repair corruputed dat... Closed
is duplicated by SERVER-29555 Make repair more robust, or optionall... Closed
is duplicated by SERVER-29557 Allow healthy databases to skip repairs Closed
Related
is related to SERVER-18640 Wiredtiger does not recover from uncl... Closed
is related to SERVER-26924 Cannot start or --repair mongod becau... Closed
is related to SERVER-36633 Use WiredTiger log file salvage to re... Closed
Backwards Compatibility: Minor Change
Backport Requested:
v4.0
Sprint: Storage NYC 2018-06-18, Storage NYC 2018-09-10, Storage NYC 2018-09-24
Participants:
Case:

 Description   
Issue Status as of October 1st, 2018

ISSUE DESCRIPTION AND IMPACT
The mongod --repair option was originally introduced for use with the MMAP storage engine; when it is used with WiredTiger, attempts to recover a corrupted dbpath via mongod --repair may fail under a number of specific scenarios.
Enhanced repair functionality allows mongod --repair to successfully recover from a wider variety of faulty conditions that previously would have resulted in a repair failure. It’s important to note that these changes do not allow the mongod to recover otherwise unretrievable data; instead, they ensure that the data set is returned to a working state with as much data as the process was able to salvage.
In addition to a more robust repair mechanism, this change adds the following new behavior:

  • If the repair operation modifies data for a node in a replica set, it will not be able to rejoin the replica set until it has been fully resynced. This behavior is designed to prevent an instance where a node with only partial data recovered via mongod --repair could potentially become a replica set primary, as this would result in data effectively going missing.
  • If a repair operation fails for any reason, the node will not be able to start up again without the mongod --repair option. This precaution is included to prevent instances where the mongod is repeatedly restarted with a broken data set, potentially resulting in additional data corruption.

DIAGNOSIS AND AFFECTED VERSIONS
This issue is exhibited whenever a mongod --repair command fails to start the mongod and instead returns an error message. There are several error messages than can be returned - some of the most common:

Fatal Assertion 28558 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 

WiredTiger.wt: encountered an illegal file format or internal value

While these are only some of the most common, most mongod --repair operations that fail to boot the mongod exhibit this issue.
This issue affects MongoDB versions 3.0 - 4.0.2 that use the WiredTiger storage engine.

REMEDIATION AND WORKAROUNDS
Currently, the only workaround available is to resync from a healthy node in a replica set, restore the dbpath from an earlier backup, or open a SERVER project ticket to request a manual repair attempt of the WiredTiger metadata files.

FIX VERSIONS
This issue is fixed in MongoDB 4.0.3 as well as in 4.1.4, and will be available in the 4.2 production release.

Original description

The repair loop should be more forgiving about failures such as missing files and deal with collections or indexes missing from the catalog with a big warning message.



 Comments   
Comment by Kalyan Kumar A [ 16/Jul/20 ]

Hi Louis Williams 

Thanks, I have similar issue with 4.0.19 version, are the above fixes applicable in my version, can you please update.

Thanks

Comment by Louis Williams [ 21/Sep/18 ]

All changes described in my previous comment have been backported to 4.0, with the exception of the removal of the repairDatabase command. This command's behavior remains unchanged in 4.0.

Comment by Louis Williams [ 18/Sep/18 ]

Starting mongod with --repair on a WiredTiger data directory now handles and recovers from the following scenarios:

  • Corrupt .wt files (existing behavior)
    • Collections are salvaged by discarding corrupt data
    • Indexes are unconditionally rebuilt
  • Missing .wt data files (for both collections and indexes)
  • Unsalvageable collection data files
    • SERVER-35782 Repair moves aside unsalvageable data files and creates empty ones in their place
  • Corrupt WiredTiger metadata files
    • SERVER-35629 Salvage corrupt WiredTiger.wt/WiredTiger.turtle files by discarding corrupt data
  • “Orphaned” data files
    • SERVER-28734 Recover collection files missing from the WiredTiger metadata, but present in the _mdb_catalog
    • SERVER-35696 Recover collection files missing from the _mdb_catalog, but present in WiredTiger
    • Note: there is no support for "importing" files that are missing from both metadata sources

Additionally, --repair has the following new behavior:

  • SERVER-35731 If a repair operation modifies data, the node will not be able to rejoin a replica set without a full resync
    • Note: if a repair operation fails for any reason, the node will be unable to start up again without the --repair option.
  • SERVER-28990 MongoDB will not bind to a port when started with --repair.
  • SERVER-36208 The repairDatabase command has been removed in 4.1
Comment by Kelsey Schubert [ 04/Jul/17 ]

Hi ccornel,

This issue would be best addressed in a new SERVER ticket. Would you please open one so we can investigate?

Thank you,
Thomas

Comment by Carlos Cornel [ 04/Jul/17 ]

Hi
I tried the previous steps, but I have not been lucky with the wt tools tools, to repair the wiredtiger.wt file, what else can I recommend doing to extract the data?

I hope someone can help please...

my files
https://drive.google.com/open?id=0BwyEAdclOsc_X0Nfa3RRd3RMT0k

Regards

Carlos

Comment by Alexander Gorrod [ 08/Jan/17 ]

Once I have .wt files, I should be able to recreate the database. Now there is no way I can recreate the database from the .wt files if I don't have the other files like WiredTiger ..

That is correct - you need the other files from the MongoDB database directory. Those files contain important metadata information that is necessary for WiredTiger to know which data is where, and how to access the data in those files.

If the files in your database directory have become corrupted, there are several steps to take to get back online. The following options are listed in priority order:

  1. Re-sync from another node in an active replica set.
  2. Restore the full database from a recent backup copy.
  3. Start MongoDB with the repair option, and have it attempt to automatically recover data.
  4. Manually retrieve data from the corrupted files, and attempt to reconstruct the original data set.

The goal of this ticket is to move the line between options 3 and 4, so that automatic repair is possible in more cases.

Comment by Muhammad Haris NP [ 08/Jan/17 ]

Once I have .wt files, I should be able to recreate the database. Now there is no way I can recreate the database from the .wt files if I don't have the other files like WiredTiger ..

Comment by Stefan Rogin [ 23/Jun/16 ]

Important feature, to ignore corrupt collections, when recovering a database with many collections and have only a tiny/recoverable one corrupt or out of alignment.

Comment by Asya Kamsky [ 01/Jun/16 ]

Users have requested ability to start mongod to either manually or automatically to "drop" entities that don't have files/directories present.

Comment by Alexander Gorrod [ 19/Oct/15 ]

Thanks david.hows. The categorization is taking shape now. Can you think of a way to test the different scenarios? Preferably as part of the regular MongoDB test suite.

Comment by David Hows [ 14/Oct/15 ]

Database files missing

  • An entry for a file will exist in the catalogue, but on disk file is gone
  • Will be impossible to recover from, remove the entry from the catalogue
  • Warn the user strongly about this (Error message)

Database files corrupted

  • An entry for a file will exist in the catalogue, but on disk file is unable to be opened
  • Attempt to rename the collection with WiredTiger to a new table that has some mention of it being corrupted in the name
  • Re-create the same collection with the same name (in order to continue repair)
  • Warn the user strongly about this problem, the creation of the new collection

Index files missing

  • An entry will exist in the catalogue, but on disk file is gone
  • Build the index as part of repair

Index files corrupted

  • An entry will exist in the catalogue, but on disk file is unable to be opened
  • Drop, then rebuild the index as part of repair

MongoDB catalogue metadata may be out of alignment with the WT files on disk

  • When something is missing on disk, then this should be resolved by the changes above
  • When something is missing from the catalogue metadata but exists as a wt table on disk we have no recourse. We would need a user accessible function to import
  • If the WiredTiger metadata is corrupt, then the database is corrupt
Generated at Thu Feb 08 03:52:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.