[SERVER-4607] cases of memory corruption in mongod and mongos on exit Created: 03/Jan/12  Updated: 06/Dec/22  Resolved: 15/Nov/16

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aaron Staple Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Operating System: ALL
Participants:

 Description   

Analysis below is relatively speculative, and I haven't proven any of it via testing. Hopefully this ticket is at least useful as a survey of cases where we've seen memory corruption during shutdown, often as a cascade failure after an earlier failure triggered the shutdown itself.

It looks like memory corruption can be triggered after an unclean shutdown and potentially can occur after a clean shutdown too. From the examples I've seen it kind of looks like double frees are occurring because global objects with members that manage their own heap memory are destroyed as the process is exiting, and then the heap memory of these members are freed again as a result of actions taken by a thread that's still running.

Here are some historical jira cases where I've seen this occur:

1)

When Top::cloneMap called by SnapshotData::takeSnapshot. Potentially caused by the global statsSnapshots variable getting destroyed, and then one of its SnapshotData also getting destroyed but having takeSnapshot called on it and trying to reassign its _usage map. Potentially the _usage map was already destroyed, and its heap memory was freed, but the reassignment attempts to free the heap memory again. Stack traces that might be related to this found in:

FREE-3600
SERVER-2695
SERVER-4190

It looks like SnapshotThread::run checks inShutdown(), but a shutdown occurs after after the inShutdown() check but before or during the call to takeSnapshot() the same double free might occur as in the unclean shutdown cases.

2)

Maybe when an an immediate exit occurs during a shutdown, global objects required for shutdown may be destroyed while still in use?

SERVER-3869

3)

Here it kind of looks like there is a global freed or left in a bad state and then another exit call attempts to free it again.

CS-501

4)

Mystery failure after a clean shutdown

SERVER-414 (very old mongo version)

5)

On mongos these failures may potentially have occurred in similar situations:

SERVER-3082
SERVER-4367
CS-1903
SERVER-2930
FREE-3696
SERVER-4576

I would recommend that we do a closer examination of 2-5 above, do an audit for additional cases, and then fix all known cases.


Generated at Thu Feb 08 03:06:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.