Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15109

An error shutting down can prevent restart despite nothing being wrong

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Won't Fix
    • None
    • None
    • MMAPv1, Storage
    • None
    • Storage Execution
    • ALL
    • 0

    Description

      If MongoD encounters an error during shutdown after the journal files are cleaned up (prior to clearing mongod.lock) it will subsequently refuse to start. If the same error is encountered before the journal is deleted, MongoD will subsequently start correctly.

      This leads to the bizarre operational condition that when journaling is enabled, the MongoD is more likely to start after a SIGKILL (kill -9) than a SIGTERM (kill -15).

      MongoD halts if it finds a non-empty mongod.lock file but cannot locate journal files. MongoD deletes journal files during shutdown after flushing of data to disk prior to clearing the mongod.lock. However, certain other tasks are carried out inbetween these operations with the clearing of mongod.lock content being the last thing done. The content of mongod.lock is cleared too late in the shutdown sequence to be a reliable indicator of whether the journal files were applied successfully (and were deleted) as opposed to going missing.

      In the error message starting mongod, it is stated that "this is likely human error or filesystem corruption.". However, there's no human error or filesystem corruption.
      2. The recovery procedure documented at http://dochub.mongodb.org/core/repair indicated that it is for the case where journaling is turned off. We hit this with journaling on.

      Possible alternative:
      Given that the journal files are idempotent MongoD could leave a "journal is clear" signal file indicating the journal was cleared down correctly (i.e "data is stable"). The deletion of the journal files can then proceed. Should the MongoD crash or halt for whatever reason after this point, either the journal files will persist or the signal file indicating the journal was already applied will persist. Either way, the MongoD can uniquely determine the stability of the data files next time it is started.

      Attachments

        Issue Links

          Activity

            People

              backlog-server-execution Backlog - Storage Execution Team
              andrew.ryder@mongodb.com Andrew Ryder (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: