Details
-
Bug
-
Resolution: Gone away
-
Major - P3
-
None
-
3.4.16
-
None
-
None
-
Server Triage
-
ALL
Description
MongoDB server version: 3.4.16
Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart"
mongod never comes to a complete stop:
2018-08-06T07:07:08.621-0700 I COMMAND [conn1040] terminating, shutdown command received |
2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] shutdown: going to close listening sockets... |
2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 7 |
2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 8 |
2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] removing socket file: /tmp/mongodb-27018.sock |
2018-08-06T07:07:08.622-0700 I NETWORK [conn1040] shutdown: going to flush diaglog... |
2018-08-06T07:07:08.622-0700 I REPL [conn1040] shutting down replication subsystems |
2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication reporter thread |
2018-08-06T07:07:08.622-0700 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid |
2018-08-06T07:07:08.622-0700 D REPL [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid |
2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication fetcher thread |
2018-08-06T07:07:08.624-0700 I REPL [conn1040] Stopping replication applier thread |
2018-08-06T07:07:08.624-0700 D REPL [replication-3] oplog query cancelled |
2018-08-06T07:07:08.624-0700 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open |
2018-08-06T07:07:08.624-0700 D REPL [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018 |
2018-08-06T07:07:08.624-0700 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re |
-evaluating our sync source.
|
2018-08-06T07:07:08.820-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, |
term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms |
2018-08-06T07:07:08.822-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, |
term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms |
2018-08-06T07:07:08.861-0700 I COMMAND [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: |
2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z |
2018-08-06T07:07:09.362-0700 I COMMAND [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
2018-08-06T07:07:09.863-0700 I COMMAND [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
2018-08-06T07:07:10.364-0700 I COMMAND [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
2018-08-06T07:07:10.517-0700 D REPL [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700 |
2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: |
2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z |
2018-08-06T07:07:10.827-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, |
term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms |
2018-08-06T07:07:10.829-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, |
term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms |
2018-08-06T07:07:10.900-0700 I COMMAND [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: |
2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z |
2018-08-06T07:07:11.366-0700 I COMMAND [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms |
<more log content>
|
|
The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below.
rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation
states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks".
https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking
{
|
"desc" : "rsSync",
|
"threadId" : "140303695083264",
|
"active" : true,
|
"opid" : 204493,
|
"secs_running" : 554,
|
"microsecs_running" : NumberLong(554584803),
|
"op" : "none",
|
"ns" : "local.replset.minvalid",
|
"query" : {
|
|
},
|
"numYields" : 0,
|
"locks" : {
|
"Global" : "W"
|
},
|
"waitingForLock" : false,
|
"lockStats" : {
|
"Global" : {
|
"acquireCount" : {
|
"r" : NumberLong(1),
|
"w" : NumberLong(1),
|
"W" : NumberLong(1)
|
}
|
},
|
"Database" : {
|
"acquireCount" : {
|
"r" : NumberLong(1),
|
"W" : NumberLong(1)
|
}
|
}
|
}
|
},
|
{
|
"desc" : "repl writer worker 11",
|
"threadId" : "140303548225280",
|
"active" : true,
|
"opid" : 204510,
|
"op" : "none",
|
"ns" : "",
|
"query" : {
|
|
},
|
"numYields" : 0,
|
"locks" : {
|
"Global" : "w",
|
"local" : "w",
|
"Collection" : "w"
|
},
|
"waitingForLock" : false,
|
"lockStats" : {
|
"Global" : {
|
"acquireCount" : {
|
"w" : NumberLong(1)
|
}
|
},
|
"Database" : {
|
"acquireCount" : {
|
"w" : NumberLong(1)
|
}
|
},
|
"oplog" : {
|
"acquireCount" : {
|
"w" : NumberLong(1)
|
}
|
}
|
}
|
},
|