Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56167

Guarantee hang analyzer collects core dumps for sharded clusters, at minimum

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Sprint:
      STM 2021-06-14
    • Story Points:
      2

      Description

      Attaching gdb and collecting diagnostics for all processes in a sharded cluster continues to time out after 15 minutes. BF-20581 is a recent example where only 6 of the 9 mongod processes were attached to. Server engineers may end up relying on good luck or having access to multiple occurrences to successfully interpret the cause of a hang.

      We should consider 1. reordering the steps in hang analyzer so a core dump can be captured for every mongod process even if the diagnostics against the live process cannot, or 2. we should consider sending a SIGABRT to any process gcore wasn't run on before the 15 minutes expire.

        Attachments

          Activity

            People

            Assignee:
            mikhail.shchatko Mikhail Shchatko
            Reporter:
            robert.guo Robert Guo
            Participants:
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: