-
Type:
Bug
-
Resolution: Incomplete
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.2.5
-
Component/s: None
-
None
-
ALL
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We use mongo as recovery storage where data is inserted and deleted millions of times but read only very rarely. Data is first buffered in the application and then transfered in chunks to the server (via C# Drivers DeleteManyAsync/InsertManyAsync).
Every few months however the mongod process starts using up most of the CPU and becomes very unresponsive. It can happen that we're still able to run queries from the mongo shell but patience is needed - 3 out of 4 connection request via mongo / mongostat / mongotop time out.
By this time we usually need to shutdown our application and close all pending operations - but even with closed application mongod does not recover and still use 6 out of 12 CPU sockets.
When we're able to connect and run db.currentOp we can still see prior insert/delete operations running (see below)
Is there anything we need to update in the configuration, is there anything we can do to prevent this problem or is this a bug in mongod?
VERSION:
3.2.5 on Windows
CONFIGURATION:
storage: dbPath: D:\company\DataFiles\MongoDB\v3\Server1 wiredTiger.engineConfig.cacheSizeGB: 1 systemLog: destination: file path: D:\company\LogFiles\MongoDB\v3\Server1\mongod_Server1.log net.port: 27017 security.authorization: enabled
DUMPFILE:
A memory dump of the process was created as well. I can provide you call stacks and thread usages but I have to clear with management if I can provide you the dump itself.
LOGFILE (after db.setLogLevel(5))
2017-03-23T00:49:16.415+0100 D COMMAND [conn6648] run command admin.$cmd { isMaster: 1 } 2017-03-23T00:49:16.415+0100 I COMMAND [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:16.415+0100 D COMMAND [conn6648] run command admin.$cmd { buildInfo: 1 } 2017-03-23T00:49:16.415+0100 I COMMAND [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms 2017-03-23T00:49:21.094+0100 D COMMAND [conn6973] run command admin.$cmd { ping: 1 } 2017-03-23T00:49:21.094+0100 I COMMAND [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms 2017-03-23T00:49:24.943+0100 D COMMAND [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms 2017-03-23T00:49:24.943+0100 D - [PeriodicTaskRunner] cleaning up unused lock buckets of the global lock manager 2017-03-23T00:49:24.943+0100 D COMMAND [PeriodicTaskRunner] task: UnusedLockCleaner took: 0ms 2017-03-23T00:49:24.943+0100 D COMMAND [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 0ms 2017-03-23T00:49:25.273+0100 D NETWORK [HostnameCanonicalizationWorker] Hostname Canonicalizer is acquiring host FQDNs 2017-03-23T00:49:25.273+0100 D NETWORK [HostnameCanonicalizationWorker] Hostname Canonicalizer acquired FQDNs 2017-03-23T00:49:26.420+0100 D COMMAND [conn6648] run command admin.$cmd { isMaster: 1 } 2017-03-23T00:49:26.420+0100 I COMMAND [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:26.421+0100 D COMMAND [conn6648] run command admin.$cmd { buildInfo: 1 } 2017-03-23T00:49:26.421+0100 I COMMAND [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms 2017-03-23T00:49:28.462+0100 D COMMAND [conn6973] run command admin.$cmd { ismaster: 1 } 2017-03-23T00:49:28.462+0100 I COMMAND [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:36.095+0100 D COMMAND [conn6973] run command admin.$cmd { ping: 1 } 2017-03-23T00:49:36.095+0100 I COMMAND [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms 2017-03-23T00:49:36.416+0100 D COMMAND [conn6648] run command admin.$cmd { isMaster: 1 } 2017-03-23T00:49:36.416+0100 I COMMAND [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:36.416+0100 D COMMAND [conn6648] run command admin.$cmd { buildInfo: 1 } 2017-03-23T00:49:36.417+0100 I COMMAND [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms 2017-03-23T00:49:46.421+0100 D COMMAND [conn6648] run command admin.$cmd { isMaster: 1 } 2017-03-23T00:49:46.421+0100 I COMMAND [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:46.422+0100 D COMMAND [conn6648] run command admin.$cmd { buildInfo: 1 } 2017-03-23T00:49:46.422+0100 I COMMAND [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms 2017-03-23T00:49:51.095+0100 D COMMAND [conn6973] run command admin.$cmd { ping: 1 } 2017-03-23T00:49:51.095+0100 I COMMAND [conn6973] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 0ms 2017-03-23T00:49:56.416+0100 D COMMAND [conn6648] run command admin.$cmd { isMaster: 1 } 2017-03-23T00:49:56.416+0100 I COMMAND [conn6648] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:49:56.417+0100 D COMMAND [conn6648] run command admin.$cmd { buildInfo: 1 } 2017-03-23T00:49:56.417+0100 I COMMAND [conn6648] command admin.$cmd command: buildInfo { buildInfo: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:1016 locks:{} protocol:op_query 0ms 2017-03-23T00:49:58.965+0100 D COMMAND [conn6973] run command admin.$cmd { ismaster: 1 } 2017-03-23T00:49:58.966+0100 I COMMAND [conn6973] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms 2017-03-23T00:50:06.096+0100 D COMMAND [conn6973] run command admin.$cmd { ping: 1 }
MONGOTOP
ns total read write 2017-03-22T23:45:20+01:00 Recovery.BookmarkRegistrationData 0ms 0ms 0ms Recovery.BookmarkStatusData 0ms 0ms 0ms Recovery.IpcMessage 0ms 0ms 0ms Recovery.WorkflowDefinitionData 0ms 0ms 0ms Recovery.WorkflowInstanceHeaderData 0ms 0ms 0ms Recovery.WorkflowInstanceValueData 0ms 0ms 0ms Recovery.XmlAppearedContext 0ms 0ms 0ms Recovery.XmlBatchInfo 0ms 0ms 0ms Recovery.XmlMaterialFlowContext 0ms 0ms 0ms Recovery.system.profile 0ms 0ms 0ms
MONGOSTAT
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:37+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:38+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:39+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:40+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:41+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:42+01:00 *0 *0 *0 *0 0 3|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 196b 19k 21 2017-03-22T22:52:43+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:44+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:45+01:00 *0 *0 *0 *0 0 3|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 196b 19k 21 2017-03-22T22:52:46+01:00 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:47+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:48+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:49+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:50+01:00 *0 *0 *0 *0 0 6|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 635b 19k 21 2017-03-22T22:52:51+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:52+01:00 *0 *0 *0 *0 0 3|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 196b 19k 21 2017-03-22T22:52:53+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:54+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:55+01:00 *0 *0 *0 *0 0 3|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 196b 19k 21 2017-03-22T22:52:56+01:00 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn time *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:57+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:58+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:52:59+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:53:00+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:53:01+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:53:02+01:00 *0 *0 *0 *0 0 3|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 196b 19k 21 2017-03-22T22:53:03+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:53:04+01:00 *0 *0 *0 *0 0 1|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 79b 18k 21 2017-03-22T22:53:05+01:00 *0 *0 *0 *0 0 8|0 1.1 9.6 0 2.8G 2.8G 0|0 0|5 752b 20k 21 2017-03-22T22:53:06+01:00
db.currentOp() (at this time all applications are closed)
{ "inprog" : [ { "desc" : "conn6975", "threadId" : "24484", "connectionId" : 6975, "client" : "127.0.0.1:56704", "active" : true, "opid" : 17839001, "secs_running" : 0, "microsecs_running" : NumberLong(47), "op" : "command", "ns" : "admin.$cmd", "query" : { "currentOp" : 1 }, "numYields" : 0, "locks" : { }, "waitingForLock" : false, "lockStats" : { } }, { "desc" : "conn6904", "threadId" : "2360", "connectionId" : 6904, "client" : "12.34.56.78:60398", "active" : true, "opid" : 17756520, "secs_running" : 106674, "microsecs_running" : NumberLong("106674215199"), "op" : "insert", "ns" : "Recovery.BookmarkStatusData", "query" : { "InstanceId" : BinData(3,"7Hw4VhndWU6ItN4YHJpHGA=="), "BookmarkName" : "StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd", "BookmarkStatus" : 2, "_id" : "56387cec-dd19-4e59-88b4-de181c9a4718StartOnHostMessage_b43fbd1b-e24f-4fb1-8477-e12ee16adcdd" }, "numYields" : 0, "locks" : { "Global" : "w", "Database" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "Collection" : { "acquireCount" : { "w" : NumberLong(1) } } } }, { "desc" : "conn6893", "threadId" : "23732", "connectionId" : 6893, "client" : "12.34.56.78:57772", "active" : true, "opid" : 17756481, "secs_running" : 106683, "microsecs_running" : NumberLong("106683958128"), "op" : "remove", "ns" : "Recovery.WorkflowInstanceHeaderData", "query" : { "$msg" : "query not recording (too large)" }, "numYields" : 0, "locks" : { "Global" : "w", "Database" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "Collection" : { "acquireCount" : { "w" : NumberLong(1) } } } }, { "desc" : "conn6899", "threadId" : "21736", "connectionId" : 6899, "client" : "12.34.56.78:59030", "active" : true, "opid" : 17756478, "secs_running" : 106684, "microsecs_running" : NumberLong("106684033208"), "op" : "remove", "ns" : "Recovery.XmlMaterialFlowContext", "query" : { "$msg" : "query not recording (too large)" }, "numYields" : 0, "locks" : { "Global" : "w", "Database" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "Collection" : { "acquireCount" : { "w" : NumberLong(1) } } } }, { "desc" : "conn6900", "threadId" : "26656", "connectionId" : 6900, "client" : "12.34.56.78:59368", "active" : true, "opid" : 17756502, "secs_running" : 106680, "microsecs_running" : NumberLong("106680377639"), "op" : "insert", "ns" : "Recovery.BookmarkRegistrationData", "query" : { "$msg" : "query not recording (too large)" }, "numYields" : 0, "locks" : { "Global" : "w", "Database" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "Collection" : { "acquireCount" : { "w" : NumberLong(1) } } } }, { "desc" : "conn6909", "threadId" : "316", "connectionId" : 6909, "client" : "12.34.56.78:60817", "active" : true, "opid" : 17756477, "secs_running" : 106684, "microsecs_running" : NumberLong("106684277826"), "op" : "insert", "ns" : "Recovery.WorkflowInstanceValueData", "query" : { "$msg" : "query not recording (too large)" }, "numYields" : 0, "locks" : { "Global" : "w", "Database" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(3), "w" : NumberLong(3) } }, "Database" : { "acquireCount" : { "w" : NumberLong(3) } }, "Collection" : { "acquireCount" : { "w" : NumberLong(3) } } } } ], "ok" : 1 }