[SERVER-16842] moveChunk with replication is slow with 2.8 Created: 14/Jan/15  Updated: 09/Jul/16  Resolved: 21/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Rui Zhang (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File moveChunkPerf.tar.gz    
Issue Links:
Depends
Related
related to DOCS-4318 Document secondaryThrottle improvements Closed
related to SERVER-14041 enhance secondaryThrottle parameter Closed
is related to DOCS-4716 moveChunk _secondaryThrottle argument... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Participants:

 Description   

moveChunk when the collection is replicated is slow when comparing to 2.6.6, possible due to tail local.oplog.rs

Test

  • sharded cluster with replication
  • time moveChunk for a chunk with around 16913 doc
  • _waitForDelete: true

moveChunk test results

it takes around 80-90 second to move one chunk for 2.8 rc4, and it takes around 4 second for 2.6.6. The results are similar for either mmapv1 or wired tiger

raw data (the timestamp is start and end migration ts)

rc4:  mmapv1
 
    { "_id" : 3 } -->> { "_id" : 16915 } on : rs1 Timestamp(14, 4)    <-- chunk details
 
    Wed Jan 14 2015 00:27:16 GMT+0000 (UTC),Wed Jan 14 2015 00:28:40 GMT+0000 (UTC)  <-- two timestamps, first is for start of moveChunk, the second is when the operation stop. 
    Wed Jan 14 2015 00:28:40 GMT+0000 (UTC),Wed Jan 14 2015 00:30:09 GMT+0000 (UTC)
 
 
2.6.6
    { "_id" : 1141 } -->> { "_id" : 18054 } on : rs1 Timestamp(8, 0)
 
    Wed Jan 14 2015 00:56:49 GMT+0000 (UTC),Wed Jan 14 2015 00:56:53 GMT+0000 (UTC)
    Wed Jan 14 2015 00:56:53 GMT+0000 (UTC),Wed Jan 14 2015 00:56:57 GMT+0000 (UTC)
    Wed Jan 14 2015 00:56:57 GMT+0000 (UTC),Wed Jan 14 2015 00:57:01 GMT+0000 (UTC)

without _waitForDelete, time is around half, still slow

# tested with rc5 
Wed Jan 14 2015 15:13:12 GMT+0000 (UTC),Wed Jan 14 2015 15:13:56 GMT+0000 (UTC)
Wed Jan 14 2015 15:13:56 GMT+0000 (UTC),Wed Jan 14 2015 15:14:40 GMT+0000 (UTC)
Wed Jan 14 2015 15:14:40 GMT+0000 (UTC),Wed Jan 14 2015 15:15:24 GMT+0000 (UTC)

getMore for local.oplog.rs

when debug, found that RC4 have far more getMore commands than that of 2.6.6, after check log, found all of the getMore local.oplog.rs are nreturned:1, which it is not the case for 2.6.6

for example

# rc4
2015-01-14T14:17:12.529+0000 I QUERY    [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0  nreturned:1 reslen:123 0ms
 
# 2.6.6
2015-01-14T14:29:00.329+0000 [conn7] getmore local.oplog.rs cursorid:18596876821 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:89 nreturned:108 reslen:11144 0ms

count of getMore (with logLevel=1)

 
# rc4  -- total getMore
[SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | wc
  16926  203112 2471218
 
# rc4 -- total getMore with nreturned != 1, and they are all nreturned:0
[SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 " | wc
     11     132    1628
 
[SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 "
2015-01-14T14:15:45.668+0000 I QUERY    [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0  nreturned:0 reslen:20 5011ms
...
 
# 2.6.6
[SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | wc
    179    2327   29928
[SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | grep -v "nreturned:1 " | wc
    173    2249   28941



 Comments   
Comment by Asya Kamsky [ 11/Feb/15 ]

Note that secondaryThrottle was true by default in 2.4 (see SERVER-7779).

Comment by Spencer Brody (Inactive) [ 21/Jan/15 ]

This is actually just do to the change from SERVER-14041 that made _secondaryThrottle default to true.

Generated at Thu Feb 08 03:42:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.