Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15005

Connections Spike on Secondary, Load Jumps, Server Becomes Unresponsive

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Incomplete
    • Affects Version/s: 2.6.3, 2.6.4
    • Fix Version/s: None
    • Component/s: Indexing
    • Labels:
      None
    • Environment:
      Amazon Linux AMI release 2014.03
      r3.2xlarge instances
    • Operating System:
      Linux

      Description

      We have a 3 shard cluster, which each shard consisting of a primary, secondary, and a hidden secondary (for EBS snapshots). Each of the nodes is identical to all of the others.

      We've seen the issue described below once every 2-3 weeks on 2.6.3. After upgrading to 2.6.4, we saw it at least hourly, sometimes as frequent as every 5-10 minutes. When it does occur, our production system goes down.

      The symptoms of the issue are a sudden spike in the number of connections to the visible secondary on our first shard. We haven't seen it occur on the primary, nor have we seen it occur on any of the other shards.

      The connections seem to all deadlock-- the I/O on the machine drops dramatically when this occurs. I've attached a screenshot of the machine reporting from New Relic that shows this-- user CPU spiking while disk IO goes to 0. I've also attached the mms reports for this, which show the connections spiking while the number of operations fall dramatically.

      Interestingly the lock spikes during this time as well, and that is all coming from the local database.

      There is no much in the logs of interest, and certainly no smoking gun. I've attached the log, and the spike in connections appears to occur at: 2014-08-22T18:38:10.730+0000

      Finally, restarting the effective mongod immediately resolves the issue.

        Attachments

        1. mms.pdf
          236 kB
        2. mongo-log.log.gz
          318 kB
        3. Screen Shot 2014-08-22 at 11.59.52 AM.png
          Screen Shot 2014-08-22 at 11.59.52 AM.png
          96 kB
        4. slow_queries.png
          slow_queries.png
          144 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: