[SERVER-28441] High CPU caused by dead/hanging connections Created: 23/Mar/17  Updated: 21/Jun/17  Resolved: 22/May/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Klaus Linzner Assignee: Mark Agarunov
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

none that we're aware of...

Participants:

 Description   

We use mongo as recovery storage where data is inserted and deleted millions of times but read only very rarely. Data is first buffered in the application and then transfered in chunks to the server (via C# Drivers DeleteManyAsync/InsertManyAsync).
Every few months however the mongod process starts using up most of the CPU and becomes very unresponsive. It can happen that we're still able to run queries from the mongo shell but patience is needed - 3 out of 4 connection request via mongo / mongostat / mongotop time out.

By this time we usually need to shutdown our application and close all pending operations - but even with closed application mongod does not recover and still use 6 out of 12 CPU sockets.
When we're able to connect and run db.currentOp we can still see prior insert/delete operations running (see below)

Is there anything we need to update in the configuration, is there anything we can do to prevent this problem or is this a bug in mongod?

VERSION:
3.2.5 on Windows

CONFIGURATION:

storage:
    dbPath: D:\company\DataFiles\MongoDB\v3\Server1
    wiredTiger.engineConfig.cacheSizeGB: 1
 
systemLog:
    destination: file
    path: D:\company\LogFiles\MongoDB\v3\Server1\mongod_Server1.log
 
net.port: 27017
 
security.authorization: enabled

DUMPFILE:
A memory dump of the process was created as well. I can provide you call stacks and thread usages but I have to clear with management if I can provide you the dump itself.

LOGFILE (after db.setLogLevel(5))

2017-03-23T00:49:16.415+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
2017-03-23T00:49:16.415+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:16.415+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
2017-03-23T00:49:16.415+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
2017-03-23T00:49:21.094+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
2017-03-23T00:49:21.094+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms
2017-03-23T00:49:24.943+0100 D -        [PeriodicTaskRunner] cleaning up unused lock buckets of the global lock manager
2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: UnusedLockCleaner took: 0ms
2017-03-23T00:49:24.943+0100 D COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms
2017-03-23T00:49:25.273+0100 D NETWORK  [HostnameCanonicalizationWorker] Hostname Canonicalizer is acquiring host FQDNs
2017-03-23T00:49:25.273+0100 D NETWORK  [HostnameCanonicalizationWorker] Hostname Canonicalizer acquired FQDNs
2017-03-23T00:49:26.420+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
2017-03-23T00:49:26.420+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:26.421+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
2017-03-23T00:49:26.421+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
2017-03-23T00:49:28.462+0100 D COMMAND  [conn6973] run command admin.$cmd { ismaster: 1 }
2017-03-23T00:49:28.462+0100 I COMMAND  [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:36.095+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
2017-03-23T00:49:36.095+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
2017-03-23T00:49:36.416+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
2017-03-23T00:49:36.416+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:36.416+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
2017-03-23T00:49:36.417+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
2017-03-23T00:49:46.421+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
2017-03-23T00:49:46.421+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:46.422+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
2017-03-23T00:49:46.422+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
2017-03-23T00:49:51.095+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }
2017-03-23T00:49:51.095+0100 I COMMAND  [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms
2017-03-23T00:49:56.416+0100 D COMMAND  [conn6648] run command admin.$cmd { isMaster: 1 }
2017-03-23T00:49:56.416+0100 I COMMAND  [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:49:56.417+0100 D COMMAND  [conn6648] run command admin.$cmd { buildInfo: 1 }
2017-03-23T00:49:56.417+0100 I COMMAND  [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms
2017-03-23T00:49:58.965+0100 D COMMAND  [conn6973] run command admin.$cmd { ismaster: 1 }
2017-03-23T00:49:58.966+0100 I COMMAND  [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
2017-03-23T00:50:06.096+0100 D COMMAND  [conn6973] run command admin.$cmd { ping: 1 }

MONGOTOP

                                 ns    total    read    write    2017-03-22T23:45:20+01:00
  Recovery.BookmarkRegistrationData      0ms     0ms      0ms
        Recovery.BookmarkStatusData      0ms     0ms      0ms
                Recovery.IpcMessage      0ms     0ms      0ms
    Recovery.WorkflowDefinitionData      0ms     0ms      0ms
Recovery.WorkflowInstanceHeaderData      0ms     0ms      0ms
 Recovery.WorkflowInstanceValueData      0ms     0ms      0ms
        Recovery.XmlAppearedContext      0ms     0ms      0ms
              Recovery.XmlBatchInfo      0ms     0ms      0ms
    Recovery.XmlMaterialFlowContext      0ms     0ms      0ms
            Recovery.system.profile      0ms     0ms      0ms

MONGOSTAT

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:37+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:38+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:39+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:40+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:41+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:42+01:00
    *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:43+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:44+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:45+01:00
    *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:46+01:00
insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:47+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:48+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:49+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:50+01:00
    *0    *0     *0     *0       0     6|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  635b    19k   21 2017-03-22T22:52:51+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:52+01:00
    *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:53+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:54+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:55+01:00
    *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:52:56+01:00
insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn                      time
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:57+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:58+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:52:59+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:00+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:01+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:02+01:00
    *0    *0     *0     *0       0     3|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  196b    19k   21 2017-03-22T22:53:03+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:04+01:00
    *0    *0     *0     *0       0     1|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5   79b    18k   21 2017-03-22T22:53:05+01:00
    *0    *0     *0     *0       0     8|0     1.1    9.6       0  2.8G 2.8G   0|0   0|5  752b    20k   21 2017-03-22T22:53:06+01:00

db.currentOp() (at this time all applications are closed)

{
        "inprog" : [
                {
                        "desc" : "conn6975",
                        "threadId" : "24484",
                        "connectionId" : 6975,
                        "client" : "127.0.0.1:56704",
                        "active" : true,
                        "opid" : 17839001,
                        "secs_running" : 0,
                        "microsecs_running" : NumberLong(47),
                        "op" : "command",
                        "ns" : "admin.$cmd",
                        "query" : {
                                "currentOp" : 1
                        },
                        "numYields" : 0,
                        "locks" : {
 
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
 
                        }
                },
                {
                        "desc" : "conn6904",
                        "threadId" : "2360",
                        "connectionId" : 6904,
                        "client" : "12.34.56.78:60398",
                        "active" : true,
                        "opid" : 17756520,
                        "secs_running" : 106674,
                        "microsecs_running" : NumberLong("106674215199"),
                        "op" : "insert",
                        "ns" : "Recovery.BookmarkStatusData",
                        "query" : {
                                "InstanceId" : BinData(3,"7Hw4VhndWU6ItN4YHJpHGA=="),
                                "BookmarkName" : "StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd",
                                "BookmarkStatus" : 2,
                                "_id" : "56387cec-dd19-4e59-88b4-de181c9a4718StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd"
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "Database" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },
                {
                        "desc" : "conn6893",
                        "threadId" : "23732",
                        "connectionId" : 6893,
                        "client" : "12.34.56.78:57772",
                        "active" : true,
                        "opid" : 17756481,
                        "secs_running" : 106683,
                        "microsecs_running" : NumberLong("106683958128"),
                        "op" : "remove",
                        "ns" : "Recovery.WorkflowInstanceHeaderData",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "Database" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },
                {
                        "desc" : "conn6899",
                        "threadId" : "21736",
                        "connectionId" : 6899,
                        "client" : "12.34.56.78:59030",
                        "active" : true,
                        "opid" : 17756478,
                        "secs_running" : 106684,
                        "microsecs_running" : NumberLong("106684033208"),
                        "op" : "remove",
                        "ns" : "Recovery.XmlMaterialFlowContext",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "Database" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },
                {
                        "desc" : "conn6900",
                        "threadId" : "26656",
                        "connectionId" : 6900,
                        "client" : "12.34.56.78:59368",
                        "active" : true,
                        "opid" : 17756502,
                        "secs_running" : 106680,
                        "microsecs_running" : NumberLong("106680377639"),
                        "op" : "insert",
                        "ns" : "Recovery.BookmarkRegistrationData",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "Database" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },
                {
                        "desc" : "conn6909",
                        "threadId" : "316",
                        "connectionId" : 6909,
                        "client" : "12.34.56.78:60817",
                        "active" : true,
                        "opid" : 17756477,
                        "secs_running" : 106684,
                        "microsecs_running" : NumberLong("106684277826"),
                        "op" : "insert",
                        "ns" : "Recovery.WorkflowInstanceValueData",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "Database" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(3),
                                                "w" : NumberLong(3)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(3)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(3)
                                        }
                                }
                        }
                }
        ],
        "ok" : 1
}



 Comments   
Comment by Mark Agarunov [ 22/May/17 ]

Hello linky,

Thank you for providing this data. Unfortunately, I am unable to find anything that would explain the behavior you're describing or reproduce the issue. My suggestion would be to upgrade to a newer version of MongoDB as we collect far more detailed diagnostic data in more recent versions as well as many fixes that may be relevant to the issue you're seeing. If the issue is still present with an up to date version of MongoDB, please let us know and we will further investigate the problem.

Thanks,
Mark

Comment by Klaus Linzner [ 27/Mar/17 ]

Sorry - that wasn't the concern...
I compressed the full directory and uploaded it. please let me know if there's anything else I can do.

BR, Klaus

Comment by Mark Agarunov [ 27/Mar/17 ]

Hello linky,

Thank you for providing these files. Unfortunately the diagnostic data you've provided appears to be corrupt. Please archive and upload the entire diagnostic.data folder if possible. The diagnostic data does not contain any user information if that is a concern. The server collects the output of the following commands at a regular interval, which you can run yourself and examine the output of:

serverStatus: db.serverStatus({tcmalloc: true})
replSetGetStatus: rs.status()
collStats for local.oplog.rs: db.getSiblingDB('local').oplog.rs.stats()
getCmdLineOpts: db.adminCommand({getCmdLineOpts: true})
buildInfo: db.adminCommand({buildInfo: true})
hostInfo: db.adminCommand({hostInfo: true})

Thanks,
Mark

Comment by Klaus Linzner [ 27/Mar/17 ]

Hi Mark,
Sorry I missed formatting...
I've uploaded diagnostics and logs for the timeframe.

Thanks a lot, Klaus

Comment by Mark Agarunov [ 23/Mar/17 ]

Hello linky,

Thank you for the report. Looking over the output you've provided, this looks like it may be related to cursors staying open. To better diagnose this issue, please archive and upload your $dbpath/diagnostic.data directory. I've created a secure upload portal so that you can send us the files. Additionally, please provide the complete logs from mongod for the timeperiod that this behavior is present.

Thanks,
Mark

Generated at Thu Feb 08 04:18:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.