[SERVER-4607] cases of memory corruption in mongod and mongos on exit Created: 03/Jan/12 Updated: 06/Dec/22 Resolved: 15/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aaron Staple | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Storage Execution
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Analysis below is relatively speculative, and I haven't proven any of it via testing. Hopefully this ticket is at least useful as a survey of cases where we've seen memory corruption during shutdown, often as a cascade failure after an earlier failure triggered the shutdown itself. It looks like memory corruption can be triggered after an unclean shutdown and potentially can occur after a clean shutdown too. From the examples I've seen it kind of looks like double frees are occurring because global objects with members that manage their own heap memory are destroyed as the process is exiting, and then the heap memory of these members are freed again as a result of actions taken by a thread that's still running. Here are some historical jira cases where I've seen this occur: 1) When Top::cloneMap called by SnapshotData::takeSnapshot. Potentially caused by the global statsSnapshots variable getting destroyed, and then one of its SnapshotData also getting destroyed but having takeSnapshot called on it and trying to reassign its _usage map. Potentially the _usage map was already destroyed, and its heap memory was freed, but the reassignment attempts to free the heap memory again. Stack traces that might be related to this found in: FREE-3600 It looks like SnapshotThread::run checks inShutdown(), but a shutdown occurs after after the inShutdown() check but before or during the call to takeSnapshot() the same double free might occur as in the unclean shutdown cases. 2) Maybe when an an immediate exit occurs during a shutdown, global objects required for shutdown may be destroyed while still in use? 3) Here it kind of looks like there is a global freed or left in a bad state and then another exit call attempts to free it again. CS-501 4) Mystery failure after a clean shutdown
5) On mongos these failures may potentially have occurred in similar situations:
I would recommend that we do a closer examination of 2-5 above, do an audit for additional cases, and then fix all known cases. |