Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16842

moveChunk with replication is slow with 2.8

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Works as Designed
    • Affects Version/s: 2.8.0-rc4
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Minor Change
    • Operating System:
      ALL

      Description

      moveChunk when the collection is replicated is slow when comparing to 2.6.6, possible due to tail local.oplog.rs

      Test

      • sharded cluster with replication
      • time moveChunk for a chunk with around 16913 doc
      • _waitForDelete: true

      moveChunk test results

      it takes around 80-90 second to move one chunk for 2.8 rc4, and it takes around 4 second for 2.6.6. The results are similar for either mmapv1 or wired tiger

      raw data (the timestamp is start and end migration ts)

      rc4:  mmapv1
       
          { "_id" : 3 } -->> { "_id" : 16915 } on : rs1 Timestamp(14, 4)    <-- chunk details
       
          Wed Jan 14 2015 00:27:16 GMT+0000 (UTC),Wed Jan 14 2015 00:28:40 GMT+0000 (UTC)  <-- two timestamps, first is for start of moveChunk, the second is when the operation stop. 
          Wed Jan 14 2015 00:28:40 GMT+0000 (UTC),Wed Jan 14 2015 00:30:09 GMT+0000 (UTC)
       
       
      2.6.6
          { "_id" : 1141 } -->> { "_id" : 18054 } on : rs1 Timestamp(8, 0)
       
          Wed Jan 14 2015 00:56:49 GMT+0000 (UTC),Wed Jan 14 2015 00:56:53 GMT+0000 (UTC)
          Wed Jan 14 2015 00:56:53 GMT+0000 (UTC),Wed Jan 14 2015 00:56:57 GMT+0000 (UTC)
          Wed Jan 14 2015 00:56:57 GMT+0000 (UTC),Wed Jan 14 2015 00:57:01 GMT+0000 (UTC)

      without _waitForDelete, time is around half, still slow

      # tested with rc5 
      Wed Jan 14 2015 15:13:12 GMT+0000 (UTC),Wed Jan 14 2015 15:13:56 GMT+0000 (UTC)
      Wed Jan 14 2015 15:13:56 GMT+0000 (UTC),Wed Jan 14 2015 15:14:40 GMT+0000 (UTC)
      Wed Jan 14 2015 15:14:40 GMT+0000 (UTC),Wed Jan 14 2015 15:15:24 GMT+0000 (UTC)

      getMore for local.oplog.rs

      when debug, found that RC4 have far more getMore commands than that of 2.6.6, after check log, found all of the getMore local.oplog.rs are nreturned:1, which it is not the case for 2.6.6

      for example

      # rc4
      2015-01-14T14:17:12.529+0000 I QUERY    [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0  nreturned:1 reslen:123 0ms
       
      # 2.6.6
      2015-01-14T14:29:00.329+0000 [conn7] getmore local.oplog.rs cursorid:18596876821 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:89 nreturned:108 reslen:11144 0ms

      count of getMore (with logLevel=1)

       
      # rc4  -- total getMore
      [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | wc
        16926  203112 2471218
       
      # rc4 -- total getMore with nreturned != 1, and they are all nreturned:0
      [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 " | wc
           11     132    1628
       
      [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 "
      2015-01-14T14:15:45.668+0000 I QUERY    [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0  nreturned:0 reslen:20 5011ms
      ...
       
      # 2.6.6
      [SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | wc
          179    2327   29928
      [SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | grep -v "nreturned:1 " | wc
          173    2249   28941

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              spencer Spencer Brody (Inactive)
              Reporter:
              rui.zhang Rui Zhang
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: