-
Type:
Bug
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.4.16
-
Component/s: None
-
None
-
Server Triage
-
ALL
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
MongoDB server version: 3.4.16
Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart"
mongod never comes to a complete stop:
2018-08-06T07:07:08.621-0700 I COMMAND [conn1040] terminating, shutdown command received 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] shutdown: going to close listening sockets... 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 7 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] closing listening socket: 8 2018-08-06T07:07:08.621-0700 I NETWORK [conn1040] removing socket file: /tmp/mongodb-27018.sock 2018-08-06T07:07:08.622-0700 I NETWORK [conn1040] shutdown: going to flush diaglog... 2018-08-06T07:07:08.622-0700 I REPL [conn1040] shutting down replication subsystems 2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication reporter thread 2018-08-06T07:07:08.622-0700 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid 2018-08-06T07:07:08.622-0700 D REPL [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid 2018-08-06T07:07:08.622-0700 I REPL [conn1040] Stopping replication fetcher thread 2018-08-06T07:07:08.624-0700 I REPL [conn1040] Stopping replication applier thread 2018-08-06T07:07:08.624-0700 D REPL [replication-3] oplog query cancelled 2018-08-06T07:07:08.624-0700 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open 2018-08-06T07:07:08.624-0700 D REPL [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018 2018-08-06T07:07:08.624-0700 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re -evaluating our sync source. 2018-08-06T07:07:08.820-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:08.822-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:08.861-0700 I COMMAND [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 2018-08-06T07:07:09.126-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z 2018-08-06T07:07:09.362-0700 I COMMAND [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:09.863-0700 I COMMAND [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:10.364-0700 I COMMAND [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:10.517-0700 D REPL [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700 2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 2018-08-06T07:07:10.518-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z 2018-08-06T07:07:10.827-0700 I COMMAND [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:10.829-0700 I COMMAND [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1, term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms 2018-08-06T07:07:10.900-0700 I COMMAND [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms 2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 2018-08-06T07:07:11.132-0700 D REPL [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z 2018-08-06T07:07:11.366-0700 I COMMAND [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms <more log content>
The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below.
rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation
states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks".
https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking
{ "desc" : "rsSync", "threadId" : "140303695083264", "active" : true, "opid" : 204493, "secs_running" : 554, "microsecs_running" : NumberLong(554584803), "op" : "none", "ns" : "local.replset.minvalid", "query" : { }, "numYields" : 0, "locks" : { "Global" : "W" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1), "W" : NumberLong(1) } }, "Database" : { "acquireCount" : { "r" : NumberLong(1), "W" : NumberLong(1) } } } },
{ "desc" : "repl writer worker 11", "threadId" : "140303548225280", "active" : true, "opid" : 204510, "op" : "none", "ns" : "", "query" : { }, "numYields" : 0, "locks" : { "Global" : "w", "local" : "w", "Collection" : "w" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "oplog" : { "acquireCount" : { "w" : NumberLong(1) } } } },