-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
None
-
Affects Version/s: 2.8.0-rc4
-
Component/s: Replication
-
Minor Change
-
ALL
moveChunk when the collection is replicated is slow when comparing to 2.6.6, possible due to tail local.oplog.rs
Test
- sharded cluster with replication
- time moveChunk for a chunk with around 16913 doc
- _waitForDelete: true
moveChunk test results
it takes around 80-90 second to move one chunk for 2.8 rc4, and it takes around 4 second for 2.6.6. The results are similar for either mmapv1 or wired tiger
raw data (the timestamp is start and end migration ts)
rc4: mmapv1 { "_id" : 3 } -->> { "_id" : 16915 } on : rs1 Timestamp(14, 4) <-- chunk details Wed Jan 14 2015 00:27:16 GMT+0000 (UTC),Wed Jan 14 2015 00:28:40 GMT+0000 (UTC) <-- two timestamps, first is for start of moveChunk, the second is when the operation stop. Wed Jan 14 2015 00:28:40 GMT+0000 (UTC),Wed Jan 14 2015 00:30:09 GMT+0000 (UTC) 2.6.6 { "_id" : 1141 } -->> { "_id" : 18054 } on : rs1 Timestamp(8, 0) Wed Jan 14 2015 00:56:49 GMT+0000 (UTC),Wed Jan 14 2015 00:56:53 GMT+0000 (UTC) Wed Jan 14 2015 00:56:53 GMT+0000 (UTC),Wed Jan 14 2015 00:56:57 GMT+0000 (UTC) Wed Jan 14 2015 00:56:57 GMT+0000 (UTC),Wed Jan 14 2015 00:57:01 GMT+0000 (UTC)
without _waitForDelete, time is around half, still slow
# tested with rc5 Wed Jan 14 2015 15:13:12 GMT+0000 (UTC),Wed Jan 14 2015 15:13:56 GMT+0000 (UTC) Wed Jan 14 2015 15:13:56 GMT+0000 (UTC),Wed Jan 14 2015 15:14:40 GMT+0000 (UTC) Wed Jan 14 2015 15:14:40 GMT+0000 (UTC),Wed Jan 14 2015 15:15:24 GMT+0000 (UTC)
getMore for local.oplog.rs
when debug, found that RC4 have far more getMore commands than that of 2.6.6, after check log, found all of the getMore local.oplog.rs are nreturned:1, which it is not the case for 2.6.6
for example
# rc4 2015-01-14T14:17:12.529+0000 I QUERY [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0 nreturned:1 reslen:123 0ms # 2.6.6 2015-01-14T14:29:00.329+0000 [conn7] getmore local.oplog.rs cursorid:18596876821 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:89 nreturned:108 reslen:11144 0ms
count of getMore (with logLevel=1)
# rc4 -- total getMore [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | wc 16926 203112 2471218 # rc4 -- total getMore with nreturned != 1, and they are all nreturned:0 [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 " | wc 11 132 1628 [SHARDS-2 wt-move-chunk-1]$ cat mongod-wt.log | grep getmore | grep -v "nreturned:1 " 2015-01-14T14:15:45.668+0000 I QUERY [conn10] getmore local.oplog.rs cursorid:23091492740 ntoreturn:0 keyUpdates:0 nreturned:0 reslen:20 5011ms ... # 2.6.6 [SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | wc 179 2327 29928 [SHARDS-2 wt-move-chunk-1]$ cat mongod-266.log | grep getmore | grep -v "nreturned:1 " | wc 173 2249 28941
- related to
-
SERVER-14041 enhance secondaryThrottle parameter
- Closed