[SERVER-15109] An error shutting down can prevent restart despite nothing being wrong Created: 02/Sep/14  Updated: 06/Dec/22  Resolved: 14/Sep/18

Status: Closed
Project: Core Server
Component/s: MMAPv1, Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Ryder (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-32091 Powercycle - remove mongod.lock file ... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:
Case:
Linked BF Score: 0

 Description   

If MongoD encounters an error during shutdown after the journal files are cleaned up (prior to clearing mongod.lock) it will subsequently refuse to start. If the same error is encountered before the journal is deleted, MongoD will subsequently start correctly.

This leads to the bizarre operational condition that when journaling is enabled, the MongoD is more likely to start after a SIGKILL (kill -9) than a SIGTERM (kill -15).

MongoD halts if it finds a non-empty mongod.lock file but cannot locate journal files. MongoD deletes journal files during shutdown after flushing of data to disk prior to clearing the mongod.lock. However, certain other tasks are carried out inbetween these operations with the clearing of mongod.lock content being the last thing done. The content of mongod.lock is cleared too late in the shutdown sequence to be a reliable indicator of whether the journal files were applied successfully (and were deleted) as opposed to going missing.

In the error message starting mongod, it is stated that "this is likely human error or filesystem corruption.". However, there's no human error or filesystem corruption.
2. The recovery procedure documented at http://dochub.mongodb.org/core/repair indicated that it is for the case where journaling is turned off. We hit this with journaling on.

Possible alternative:
Given that the journal files are idempotent MongoD could leave a "journal is clear" signal file indicating the journal was cleared down correctly (i.e "data is stable"). The deletion of the journal files can then proceed. Should the MongoD crash or halt for whatever reason after this point, either the journal files will persist or the signal file indicating the journal was already applied will persist. Either way, the MongoD can uniquely determine the stability of the data files next time it is started.


Generated at Thu Feb 08 03:36:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.