Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-4607

cases of memory corruption in mongod and mongos on exit

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • None
    • Internal Code
    • None
    • Storage Execution
    • ALL

    Description

      Analysis below is relatively speculative, and I haven't proven any of it via testing. Hopefully this ticket is at least useful as a survey of cases where we've seen memory corruption during shutdown, often as a cascade failure after an earlier failure triggered the shutdown itself.

      It looks like memory corruption can be triggered after an unclean shutdown and potentially can occur after a clean shutdown too. From the examples I've seen it kind of looks like double frees are occurring because global objects with members that manage their own heap memory are destroyed as the process is exiting, and then the heap memory of these members are freed again as a result of actions taken by a thread that's still running.

      Here are some historical jira cases where I've seen this occur:

      1)

      When Top::cloneMap called by SnapshotData::takeSnapshot. Potentially caused by the global statsSnapshots variable getting destroyed, and then one of its SnapshotData also getting destroyed but having takeSnapshot called on it and trying to reassign its _usage map. Potentially the _usage map was already destroyed, and its heap memory was freed, but the reassignment attempts to free the heap memory again. Stack traces that might be related to this found in:

      FREE-3600
      SERVER-2695
      SERVER-4190

      It looks like SnapshotThread::run checks inShutdown(), but a shutdown occurs after after the inShutdown() check but before or during the call to takeSnapshot() the same double free might occur as in the unclean shutdown cases.

      2)

      Maybe when an an immediate exit occurs during a shutdown, global objects required for shutdown may be destroyed while still in use?

      SERVER-3869

      3)

      Here it kind of looks like there is a global freed or left in a bad state and then another exit call attempts to free it again.

      CS-501

      4)

      Mystery failure after a clean shutdown

      SERVER-414 (very old mongo version)

      5)

      On mongos these failures may potentially have occurred in similar situations:

      SERVER-3082
      SERVER-4367
      CS-1903
      SERVER-2930
      FREE-3696
      SERVER-4576

      I would recommend that we do a closer examination of 2-5 above, do an audit for additional cases, and then fix all known cases.

      Attachments

        Activity

          People

            backlog-server-execution Backlog - Storage Execution Team
            aaron Aaron Staple
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: