Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38074

abnormal performance downgrade and large traffice between primary shard and config server

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.20
    • Component/s: None
    • Labels:
      None
    • ALL

      Mongo Version 3.2.20, clusters deployed as: 15 or more shard replica sets + 1 config replica sets.

      Each shard has three nodes(1 primary + 2 secondary), the storage is mmapv1.

      Abnormal performance downgrade occur irregularly:

      1. The traffics output from all config nodes are very large.
      2. only occurs between the primary node on primary shard and config servers.
      3. when the situation occurs, we found large request from primary shard to config servers, logs as below:

      log a:

      2018-11-07T18:59:07.289959+08:00 [conn51755] remotely refreshing metadata for ${db}.${collection} with requested shard version 0|0||000000000000000000000000, current shard version is 23075|5302||5bd9add9f47203867a329afd, current metadata version is 23081|428||5bd9add9f47203867a329afd

      log b:

      2018-11-07T22:03:44.271591+08:00 [conn63943] updating metadata for ${db}.${collection} from shard version 23108|23||5bd9add9f47203867a329afd to shard version 23108|23||5bd9add9f47203867a329afd, took 850 ms

      4. at this time, large slow logs were generated on primary node, the op time were growing, db.currentOp() on primary shard, shows that many queries were `"msg" : "waiting for write concern",`

      5. after we rise the loglevel, we found large FIND QUERY for chunks collection from primary node to config servers.

      2018-11-09T19:03:00.370867+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Initiating asynchronous command: RemoteCommand 608607 – target:${config svr ip}:{port} db:config expDate:2018-11-09T19:03:30.370+0800 cmd:{ find: "chunks", filter: { ns: "${db}.${collection}", lastmod: { $gte: Timestamp 0|0 } }, sort: { lastmod: 1 }, readConcern: { level: "majority", afterOpTime:

      Unknown macro: { ts}

      }, maxTimeMS: 30000 }
      2018-11-09T19:03:00.370883+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 608607 on host ${config server ip}:${port}
      2018-11-09T19:03:00.372395+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Request 608607 finished with response: { waitedMS: 0, cursor:

      Unknown macro: { firstBatch}

      , ok: 1.0 }
      2018-11-09T19:03:00.373470+08:00 [NetworkInterfaceASIO-ShardRegistry-0] Failed to time operation 608607 out: Operation aborted.

      6. This situation can be solved temporarily through stepDown config primary or stopDown primary shard. But it will happen again in the uncertain time of the future.
      7. after stepDown the primary shard, the slowlogs(UPDATE) on old primay node keep growing, we still need to restart the mongod on old primay nodeto stop it.
      8. We don't have the method to reproduce, but it occurs.

            Assignee:
            daniel.hatcher@mongodb.com Danny Hatcher (Inactive)
            Reporter:
            stutiredboy@gmail.com Adun
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: