Loading...

Type: Bug
Resolution: Gone away
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.16
Component/s: None
Labels:
None

Assigned Teams:

Server Triage
Operating System:
ALL
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

MongoDB server version: 3.4.16

Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart"

mongod never comes to a complete stop:

2018-08-06T07:07:08.621-0700 I COMMAND  [conn1040] terminating, shutdown command received
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] shutdown: going to close listening sockets...
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 7
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 8
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] removing socket file: /tmp/mongodb-27018.sock
2018-08-06T07:07:08.622-0700 I NETWORK  [conn1040] shutdown: going to flush diaglog...
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] shutting down replication subsystems
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication reporter thread
2018-08-06T07:07:08.622-0700 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid
2018-08-06T07:07:08.622-0700 D REPL     [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication fetcher thread
2018-08-06T07:07:08.624-0700 I REPL     [conn1040] Stopping replication applier thread
2018-08-06T07:07:08.624-0700 D REPL     [replication-3] oplog query cancelled
2018-08-06T07:07:08.624-0700 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open
2018-08-06T07:07:08.624-0700 D REPL     [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018
2018-08-06T07:07:08.624-0700 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re
-evaluating our sync source.
2018-08-06T07:07:08.820-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:08.822-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:08.861-0700 I COMMAND  [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z
2018-08-06T07:07:09.362-0700 I COMMAND  [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:09.863-0700 I COMMAND  [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:10.364-0700 I COMMAND  [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:10.517-0700 D REPL     [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700
2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 
2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z
2018-08-06T07:07:10.827-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:10.829-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:10.900-0700 I COMMAND  [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z
2018-08-06T07:07:11.366-0700 I COMMAND  [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
<more log content>

The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below.
rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation
states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks".
https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking

    {
                        "desc" : "rsSync",
                        "threadId" : "140303695083264",
                        "active" : true,
                        "opid" : 204493,
                        "secs_running" : 554,
                        "microsecs_running" : NumberLong(554584803),
                        "op" : "none",
                        "ns" : "local.replset.minvalid",
                        "query" : {
                                
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "W"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                }
                        }
                },

           {
                        "desc" : "repl writer worker 11",
                        "threadId" : "140303548225280",
                        "active" : true,
                        "opid" : 204510,
                        "op" : "none",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "local" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },

Details

Description

Attachments

Activity

People

Dates