Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40065

spike latency issue on shard cluster

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Performance
    • None
    • ALL

      on MongoDB 3.4.19, shard cluster with 2 shards.

      collection `foo` is sharded with  hashed key "UID"

      > sh.enableSharding("foo")
      > db.adminCommand({"shardCollection": "foo.foo", key: {"UID": "hashed"}})
      > use foo
      > db.foo.createIndex({"name": 1})

      we write a test case in mongo-cxx-driver to test the latency.

      auto nStart = std::chrono::high_resolution_clock::now();
      oDB[DB_TABLE_NAME].update_one(document{}<<"UID"<< nUID <<finalize, oValue.view(), opt);
      auto nEnd = std::chrono::high_resolution_clock::now();
      std::chrono::duration<double, std::milli> diff = nEnd - nStart;
      printf("uid %d timediff %f\n", nUID, diff.count());

       

      output as below:

      uid 10030116 timediff 1.249126
      uid 10000124 timediff 1.021864
      uid 10050127 timediff 1.172118
      uid 10020116 timediff 1.223791
      uid 10040115 timediff 1.408828
      uid 10070114 timediff 1.526046

       

      What we did:

      1. The test case run with 8 threads, each thread runs 10000 upserts loop to the collection.
      2. Each upsert operation is on different UID.
      3. Except UID/ID, all docs are always the same.

      What we found:

      1. the latency has many spike, such as 4000+ ms in some update_one operation, but not any slow log in mongod server(we set slowms to 10ms).
      2. as 1, the total timediff is unstable(FROM: ./testmongo | awk '{s+=$NF}END{print s}')

      149726
      129175
      124993
      219767
      137422
      156674
      162410
      119684
      117086
      116885

      3. after we run `rs.stepDown()` on config Primary, the spike gone away(FROM: ./testmongo | awk '{s+=$NF}END{print s}')

      106808
      107225
      107228
      108055
      106660
      108690
      105993
      107037
      106226
      104789
      106494
      105178
      107428
      108650
      106535

       

      What we tried:

      1. initial the test db without sharding but create index UID as hashed, name as 1, the latency is normal and stable;
      2. run the benchmark to each replicaSet, teh latency is normal and stable;
      3. try to stepDown replicaSet, the latency may return normal.(not always)
      4. try to stepDown config again, the latecy may abnormal/normal again.(not always)
      5. replace the wiredTiger storage to MMAPv1, have the same sistuation.(can return normal after stepDown something)
      6. try to stop balance, it has not help;
      7. downgrade to MongoDB 3.2.22, the latency is normal and stable
      8. not any abnormal log, we only found split point lookup and network connection logs.

        1. 4.0_primary.shard2.diagnostic.data.tgz
          292 kB
        2. 4.0_primary.shard1.diagnostic.data.tgz
          378 kB
        3. 4.0_primary.config.diagnostic.data.tgz
          305 kB
        4. 3.4_primary.shard2.diganostic.data.tgz
          2.37 MB
        5. 3.4_primary.shard1.diganostic.data.tgz
          3.00 MB
        6. 3.4_primary.config.diganostic.data.tgz
          2.49 MB

            Assignee:
            eric.sedor@mongodb.com Eric Sedor
            Reporter:
            stutiredboy@gmail.com Adun
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: