High CPU caused by dead/hanging connections

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Incomplete
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.2.5
    • Component/s: None
    • None
    • ALL
    • Hide

      none that we're aware of...

      Show
      none that we're aware of...
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We use mongo as recovery storage where data is inserted and deleted millions of times but read only very rarely. Data is first buffered in the application and then transfered in chunks to the server (via C# Drivers DeleteManyAsync/InsertManyAsync).
      Every few months however the mongod process starts using up most of the CPU and becomes very unresponsive. It can happen that we're still able to run queries from the mongo shell but patience is needed - 3 out of 4 connection request via mongo / mongostat / mongotop time out.

      By this time we usually need to shutdown our application and close all pending operations - but even with closed application mongod does not recover and still use 6 out of 12 CPU sockets.
      When we're able to connect and run db.currentOp we can still see prior insert/delete operations running (see below)

      Is there anything we need to update in the configuration, is there anything we can do to prevent this problem or is this a bug in mongod?

      VERSION:
      3.2.5 on Windows

      CONFIGURATION:

      storage:
          dbPath: D:\company\DataFiles\MongoDB\v3\Server1
          wiredTiger.engineConfig.cacheSizeGB: 1
      
      systemLog:
          destination: file
          path: D:\company\LogFiles\MongoDB\v3\Server1\mongod_Server1.log
      
      net.port: 27017
      
      security.authorization: enabled
      

      DUMPFILE:
      A memory dump of the process was created as well. I can provide you call stacks and thread usages but I have to clear with management if I can provide you the dump itself.

      LOGFILE (after db.setLogLevel(5))

      2017-03-23T00:49:16.415+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
      2017-03-23T00:49:16.415+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:16.415+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
      2017-03-23T00:49:16.415+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:21.094+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
      2017-03-23T00:49:21.094+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms
      2017-03-23T00:49:24.943+0100 D -        [PeriodicTaskRunner] cleaning up unused lock buckets of the global lock manager
      2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: UnusedLockCleaner took: 0ms
      2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms
      2017-03-23T00:49:25.273+0100 D NETWORK  [HostnameCanonicalizationWorker] Hostname Canonicalizer is acquiring host FQDNs
      2017-03-23T00:49:25.273+0100 D NETWORK  [HostnameCanonicalizationWorker] Hostname Canonicalizer acquired FQDNs
      2017-03-23T00:49:26.420+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
      2017-03-23T00:49:26.420+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:26.421+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
      2017-03-23T00:49:26.421+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:28.462+0100 D COMMAND  [conn6973] run command admin.$cmd { ismaster: 1 }
      2017-03-23T00:49:28.462+0100 I COMMAND  [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:36.095+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
      2017-03-23T00:49:36.095+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:36.416+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
      2017-03-23T00:49:36.416+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:36.416+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
      2017-03-23T00:49:36.417+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:46.421+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
      2017-03-23T00:49:46.421+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:46.422+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
      2017-03-23T00:49:46.422+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:51.095+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
      2017-03-23T00:49:51.095+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:56.416+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
      2017-03-23T00:49:56.416+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:56.417+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
      2017-03-23T00:49:56.417+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
      2017-03-23T00:49:58.965+0100 D COMMAND  [conn6973] run command admin.$cmd { ismaster: 1 }
      2017-03-23T00:49:58.966+0100 I COMMAND  [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
      2017-03-23T00:50:06.096+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
      

      MONGOTOP

                                       ns    total    read    write    2017-03-22T23:45:20+01:00
        Recovery.BookmarkRegistrationData      0ms     0ms      0ms
              Recovery.BookmarkStatusData      0ms     0ms      0ms
                      Recovery.IpcMessage      0ms     0ms      0ms
          Recovery.WorkflowDefinitionData      0ms     0ms      0ms
      Recovery.WorkflowInstanceHeaderData      0ms     0ms      0ms
       Recovery.WorkflowInstanceValueData      0ms     0ms      0ms
              Recovery.XmlAppearedContext      0ms     0ms      0ms
                    Recovery.XmlBatchInfo      0ms     0ms      0ms
          Recovery.XmlMaterialFlowContext      0ms     0ms      0ms
                  Recovery.system.profile      0ms     0ms      0ms
      

      MONGOSTAT

      insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:37+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:38+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:39+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:40+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:41+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:42+01:00
          *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:43+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:44+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:45+01:00
          *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:46+01:00
      insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:47+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:48+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:49+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:50+01:00
          *0    *0     *0     *0       0     6|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  635b    19k   21 2017-03-22T22:52:51+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:52+01:00
          *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:53+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:54+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:55+01:00
          *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:56+01:00
      insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:57+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:58+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:59+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:00+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:01+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:02+01:00
          *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:53:03+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:04+01:00
          *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:05+01:00
          *0    *0     *0     *0       0     8|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  752b    20k   21 2017-03-22T22:53:06+01:00
      

      db.currentOp() (at this time all applications are closed)

      {
              "inprog" : [
                      {
                              "desc" : "conn6975",
                              "threadId" : "24484",
                              "connectionId" : 6975,
                              "client" : "127.0.0.1:56704",
                              "active" : true,
                              "opid" : 17839001,
                              "secs_running" : 0,
                              "microsecs_running" : NumberLong(47),
                              "op" : "command",
                              "ns" : "admin.$cmd",
                              "query" : {
                                      "currentOp" : 1
                              },
                              "numYields" : 0,
                              "locks" : {
      
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
      
                              }
                      },
                      {
                              "desc" : "conn6904",
                              "threadId" : "2360",
                              "connectionId" : 6904,
                              "client" : "12.34.56.78:60398",
                              "active" : true,
                              "opid" : 17756520,
                              "secs_running" : 106674,
                              "microsecs_running" : NumberLong("106674215199"),
                              "op" : "insert",
                              "ns" : "Recovery.BookmarkStatusData",
                              "query" : {
                                      "InstanceId" : BinData(3,"7Hw4VhndWU6ItN4YHJpHGA=="),
                                      "BookmarkName" : "StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd",
                                      "BookmarkStatus" : 2,
                                      "_id" : "56387cec-dd19-4e59-88b4-de181c9a4718StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd"
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "Database" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Collection" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      }
                              }
                      },
                      {
                              "desc" : "conn6893",
                              "threadId" : "23732",
                              "connectionId" : 6893,
                              "client" : "12.34.56.78:57772",
                              "active" : true,
                              "opid" : 17756481,
                              "secs_running" : 106683,
                              "microsecs_running" : NumberLong("106683958128"),
                              "op" : "remove",
                              "ns" : "Recovery.WorkflowInstanceHeaderData",
                              "query" : {
                                      "$msg" : "query not recording (too large)"
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "Database" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Collection" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      }
                              }
                      },
                      {
                              "desc" : "conn6899",
                              "threadId" : "21736",
                              "connectionId" : 6899,
                              "client" : "12.34.56.78:59030",
                              "active" : true,
                              "opid" : 17756478,
                              "secs_running" : 106684,
                              "microsecs_running" : NumberLong("106684033208"),
                              "op" : "remove",
                              "ns" : "Recovery.XmlMaterialFlowContext",
                              "query" : {
                                      "$msg" : "query not recording (too large)"
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "Database" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Collection" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      }
                              }
                      },
                      {
                              "desc" : "conn6900",
                              "threadId" : "26656",
                              "connectionId" : 6900,
                              "client" : "12.34.56.78:59368",
                              "active" : true,
                              "opid" : 17756502,
                              "secs_running" : 106680,
                              "microsecs_running" : NumberLong("106680377639"),
                              "op" : "insert",
                              "ns" : "Recovery.BookmarkRegistrationData",
                              "query" : {
                                      "$msg" : "query not recording (too large)"
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "Database" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Collection" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      }
                              }
                      },
                      {
                              "desc" : "conn6909",
                              "threadId" : "316",
                              "connectionId" : 6909,
                              "client" : "12.34.56.78:60817",
                              "active" : true,
                              "opid" : 17756477,
                              "secs_running" : 106684,
                              "microsecs_running" : NumberLong("106684277826"),
                              "op" : "insert",
                              "ns" : "Recovery.WorkflowInstanceValueData",
                              "query" : {
                                      "$msg" : "query not recording (too large)"
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "Database" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(3),
                                                      "w" : NumberLong(3)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(3)
                                              }
                                      },
                                      "Collection" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(3)
                                              }
                                      }
                              }
                      }
              ],
              "ok" : 1
      }
      

            Assignee:
            Mark Agarunov (Inactive)
            Reporter:
            Klaus Linzner
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: