Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26735

Slow response to queries and no way to investigate it

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 3.2.10
    • Component/s: Querying, Sharding
    • Labels:
      None
    • ALL

      I have an 8 member replica set, with 3 replica sets for each shard, and i'm running a query which affects each shard equally. The query targets an update for almost all elements in a collection, but it has to be a full COLLSCAN. Most of the shards finish the query very fast (<5 second), but randomly, one or two of the shards take a very long time to execute, even as much as 1-2 minutes. Whenever this issue appears, it takes place on a different member of the shard as the last time. I've run db.currentOP() on the shard replica set master, and the only query running is the update. I've also tried mongotop, mongostat and even setProfilingLevel, but to no avail; i'm unable to find what is slowing me down. I've noticed between running this command twice, that the number of yields grows very fast:

      rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
              ...
              "secs_running" : 26,
              ...
              "numYields" : 1461,
              "locks" : {
                      "Global" : "w",
                      "local" : "w",
                      "Database" : "w",
                      "Collection" : "w",
                      "Metadata" : "w"
              },
              "waitingForLock" : false,
              "lockStats" : {
                      "Global" : {
                              "acquireCount" : {
                                      "r" : NumberLong(7942),
                                      "w" : NumberLong(7942)
                              }
                      },
                      "Database" : {
                              "acquireCount" : {
                                      "w" : NumberLong(7942)
                              }
                      },
                      "Collection" : {
                              "acquireCount" : {
                                      "w" : NumberLong(1462)
                              }
                      },
                      "Metadata" : {
                              "acquireCount" : {
                                      "w" : NumberLong(6480)
                              }
                      },
                      "oplog" : {
                              "acquireCount" : {
                                      "w" : NumberLong(6480)
                              }
                      }
              }
      }
      1
      
      rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
              ...
              "secs_running" : 27,
              ...
              "numYields" : 1352,
              "locks" : {
                      "Global" : "w",
                      "local" : "w",
                      "Database" : "w",
                      "Collection" : "w",
                      "Metadata" : "w"
              },
              "waitingForLock" : false,
              "lockStats" : {
                      "Global" : {
                              "acquireCount" : {
                                      "r" : NumberLong(7513),
                                      "w" : NumberLong(7513)
                              }
                      },
                      "Database" : {
                              "acquireCount" : {
                                      "w" : NumberLong(7513)
                              }
                      },
                      "Collection" : {
                              "acquireCount" : {
                                      "w" : NumberLong(1353)
                              }
                      },
                      "Metadata" : {
                              "acquireCount" : {
                                      "w" : NumberLong(6160)
                              }
                      },
                      "oplog" : {
                              "acquireCount" : {
                                      "w" : NumberLong(6160)
                              }
                      }
              }
      }
      1
      

      I have very good hardware, more than enough free RAM. I've noticed that the primary of the slow shards tend to have much less free RAM than the others, so i decided to restart all the mongod processes in the cluster, and i had no problems for 2-3 days. After that, the issue appears again. How can i investigate this issue further? What is causing the operation to yield? I've been restarting the servers for almost 10 days now. Thanks

            Assignee:
            Unassigned Unassigned
            Reporter:
            thestick613 Tudor Aursulesei
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: