Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26923

OOM Killer Terminates All 3 Nodes in a Shard Using WiredTiger

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Duplicate
    • Affects Version/s: 3.0.11, 3.2.10
    • Fix Version/s: None
    • Component/s: Text Search
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      1. Configure shard to use wiredTiger.
      2. Wait an indefinite period of time while automated tests are running (6-12 hours).
      3. Identify the oom-killer and resulting crash in the server's logs.

      Show
      1. Configure shard to use wiredTiger. 2. Wait an indefinite period of time while automated tests are running (6-12 hours). 3. Identify the oom-killer and resulting crash in the server's logs.

      Description

      Hello,

      We've recently upgraded our MongoDB deployment to 3.2.10. During this upgrade we intended to migrate the storage engine to the new wiredTiger but ran into stability issues. Seemingly randomly throughout the day all data bearing nodes would crash due to oom-killer termination.

      There are many memory leak issues with wiredTiger in JIRA, most of them fixed. The one we hoped would be beneficial was fixed in 3.2.10 (WT-2796), but alas, we still ran into the same problem while running automated tests (not particularly stressful ones) on our cluster.

      We have a multiple environment deployment and the problem presented itself in all lower environments, causing us to reverse the decision to migrate to wiredTiger until we find a way to stabilize it.

      We are using the 1.11 C# driver in our application. The higher environments are both 5 sharded clusters with 3 data bearing nodes in each shard's replica set. The config servers are not configured as a replica set and will not be migrated to wiredTiger at this point. Our application is cloud hosted in AWS and the number of servers running mongos.exe locally scales up and down automatically according to the load on service queues.

      Please let me know if I can provide any further information. This bug is marked as critical because it involves a severe memory leak, per the table of priorities.

      Thanks,
      Shy

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: