Primary crash after N hours of running as primary

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.0.6
    • Component/s: JavaScript
    • None
    • ALL
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      I manage a sharded cluster for my company. That cluster is used by clients as a free cluster: they provision a db and can use it (with some limitations) with their applications.

      I moved from 2.6 to 3.0.6 a week ago (on Thursday 2015-09-24), and ever since I have this strange behavior: after being elected as primary, a node will last a few hours (between 2 and 5) and then crash.
      The crash is a segmentation fault.

      We have systemd restarting the node automatically, and in the meantime, a new node is elected as primary and run for a few more hours then crashes, and another one is elected primary, etc.

      The cluster is composed of 3 config servers, 3 mongos, and 5 mongod all within a single RS and handling a single shard.
      The 5 mongod are 2 arbiters and 3 data nodes.
      The 3 data nodes are 1 MMAPv1 and 2 wiredTiger.

      All 3 data nodes crash a few hours after being elected master.

      I attached the log of a primary starting 30 seconds before the segfault happens.

      /sys/kernel/mm/transparent_hugepage/defrag does not exist on 2 of the 3 servers, and I set it to "never" on the third one.

        1. primary-crash.log
          61 kB
        2. mongod-ldd.txt
          2 kB
        3. mapReduce-crash.log
          21 kB
        4. crash.log
          31 kB
        5. full-crash.log
          1.63 MB
        6. mongodb-build-server.log
          2.24 MB

            Assignee:
            Unassigned
            Reporter:
            Julien Durillon
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: