cases of memory corruption in mongod and mongos on exit

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Internal Code
    • None
    • Storage Execution
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Analysis below is relatively speculative, and I haven't proven any of it via testing. Hopefully this ticket is at least useful as a survey of cases where we've seen memory corruption during shutdown, often as a cascade failure after an earlier failure triggered the shutdown itself.

      It looks like memory corruption can be triggered after an unclean shutdown and potentially can occur after a clean shutdown too. From the examples I've seen it kind of looks like double frees are occurring because global objects with members that manage their own heap memory are destroyed as the process is exiting, and then the heap memory of these members are freed again as a result of actions taken by a thread that's still running.

      Here are some historical jira cases where I've seen this occur:

      1)

      When Top::cloneMap called by SnapshotData::takeSnapshot. Potentially caused by the global statsSnapshots variable getting destroyed, and then one of its SnapshotData also getting destroyed but having takeSnapshot called on it and trying to reassign its _usage map. Potentially the _usage map was already destroyed, and its heap memory was freed, but the reassignment attempts to free the heap memory again. Stack traces that might be related to this found in:

      FREE-3600
      SERVER-2695
      SERVER-4190

      It looks like SnapshotThread::run checks inShutdown(), but a shutdown occurs after after the inShutdown() check but before or during the call to takeSnapshot() the same double free might occur as in the unclean shutdown cases.

      2)

      Maybe when an an immediate exit occurs during a shutdown, global objects required for shutdown may be destroyed while still in use?

      SERVER-3869

      3)

      Here it kind of looks like there is a global freed or left in a bad state and then another exit call attempts to free it again.

      CS-501

      4)

      Mystery failure after a clean shutdown

      SERVER-414 (very old mongo version)

      5)

      On mongos these failures may potentially have occurred in similar situations:

      SERVER-3082
      SERVER-4367
      CS-1903
      SERVER-2930
      FREE-3696
      SERVER-4576

      I would recommend that we do a closer examination of 2-5 above, do an audit for additional cases, and then fix all known cases.

            Assignee:
            [DO NOT USE] Backlog - Storage Execution Team
            Reporter:
            Aaron Staple (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: