Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36516

Secondary mongod becomes unresponsive occasionally

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.16
    • Component/s: None
    • Labels:
      None
    • Server Triage
    • ALL

      MongoDB server version: 3.4.16

      Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart"

      mongod never comes to a complete stop:

      2018-08-06T07:07:08.621-0700 I COMMAND  [conn1040] terminating, shutdown command received
      2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] shutdown: going to close listening sockets...
      2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 7
      2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 8
      2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] removing socket file: /tmp/mongodb-27018.sock
      2018-08-06T07:07:08.622-0700 I NETWORK  [conn1040] shutdown: going to flush diaglog...
      2018-08-06T07:07:08.622-0700 I REPL     [conn1040] shutting down replication subsystems
      2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication reporter thread
      2018-08-06T07:07:08.622-0700 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid
      2018-08-06T07:07:08.622-0700 D REPL     [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid
      2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication fetcher thread
      2018-08-06T07:07:08.624-0700 I REPL     [conn1040] Stopping replication applier thread
      2018-08-06T07:07:08.624-0700 D REPL     [replication-3] oplog query cancelled
      2018-08-06T07:07:08.624-0700 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open
      2018-08-06T07:07:08.624-0700 D REPL     [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018
      2018-08-06T07:07:08.624-0700 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re
      -evaluating our sync source.
      2018-08-06T07:07:08.820-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
       term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
      2018-08-06T07:07:08.822-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
       term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
      2018-08-06T07:07:08.861-0700 I COMMAND  [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
      2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z
      2018-08-06T07:07:09.362-0700 I COMMAND  [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      2018-08-06T07:07:09.863-0700 I COMMAND  [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      2018-08-06T07:07:10.364-0700 I COMMAND  [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      2018-08-06T07:07:10.517-0700 D REPL     [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700
      2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 
      2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z
      2018-08-06T07:07:10.827-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
       term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
      2018-08-06T07:07:10.829-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
       term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
      2018-08-06T07:07:10.900-0700 I COMMAND  [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
      2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z
      2018-08-06T07:07:11.366-0700 I COMMAND  [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
      <more log content>
       

      The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
      The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below.
      rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation
      states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks".
      https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking

          {
                              "desc" : "rsSync",
                              "threadId" : "140303695083264",
                              "active" : true,
                              "opid" : 204493,
                              "secs_running" : 554,
                              "microsecs_running" : NumberLong(554584803),
                              "op" : "none",
                              "ns" : "local.replset.minvalid",
                              "query" : {
                                      
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "W"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "w" : NumberLong(1),
                                                      "W" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "r" : NumberLong(1),
                                                      "W" : NumberLong(1)
                                              }
                                      }
                              }
                      },
      
                 {
                              "desc" : "repl writer worker 11",
                              "threadId" : "140303548225280",
                              "active" : true,
                              "opid" : 204510,
                              "op" : "none",
                              "ns" : "",
                              "query" : {
                                      
                              },
                              "numYields" : 0,
                              "locks" : {
                                      "Global" : "w",
                                      "local" : "w",
                                      "Collection" : "w"
                              },
                              "waitingForLock" : false,
                              "lockStats" : {
                                      "Global" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "Database" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      },
                                      "oplog" : {
                                              "acquireCount" : {
                                                      "w" : NumberLong(1)
                                              }
                                      }
                              }
                      },
      

            Assignee:
            backlog-server-triage [HELP ONLY] Backlog - Triage Team
            Reporter:
            matthias.eichstaedt Matthias Eichstaedt
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: