[SERVER-18200] Long running queries on secondary causes replication to fall behind Created: 24/Apr/15  Updated: 24/Apr/15  Resolved: 24/Apr/15

Status: Closed
Project: Core Server
Component/s: Concurrency, Replication
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Kania Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive large-test-collection.zip    
Issue Links:
Duplicate
duplicates SERVER-18190 Secondary reads may block replication Closed
Operating System: ALL
Steps To Reproduce:
  • Create replicaset with a 2.6.9 PRIMARY and a 3.0.2 SECONDARY
  • Import attached collection which contains ~500k documents

    mongoimport  -d test -c test_collection --jsonArray large-test-collection.json --host $PRIMARY --port $PRIMARY_PORT
    

  • In a mongo shell on the PRIMARY run a loop to insert a bunch of test data

    for (var i = 1; i <= 1000000; i++) {    db.testData.insert( { x : i } ) }
    

  • At the same time run a mongostat on the SECONDARY and you should see inserts flowing in at around 2k inserts/sec replicating from the PRIMARY

    insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn      set repl       time 
     *2142     *0     *0     *0       0     2|0       0   800m  4.03g   337m    112 oplog:0.0%          0       0|0     0|0   205b    10k     3 dynatest  SEC   23:11:24 
     *2154     *0     *0     *0       0     1|0       0   800m  4.03g   337m    115 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:11:25 
     *2218     *0     *0     *0       0     2|0       0   800m  4.03g   338m    116 oplog:0.0%          0       0|0     0|0   205b    10k     3 dynatest  SEC   23:11:26 
     *2166     *0     *0     *0       0     1|0       0   800m  4.03g   339m    116 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:11:27 
     *2217     *0     *0     *0       0     2|0       0   800m  4.03g   339m    117 oplog:0.0%          0       0|0     0|1   205b    10k     3 dynatest  SEC   23:11:28 
     *2172     *0     *0     *0       0     1|0       0   800m  4.03g   340m    113 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:11:29
    

  • Kick off a 4-6 second long running query on the SECONDARY

     db.test_collection.find({latitude:  { $gt: 0 }}).limit(200000).sort({_created_at: -1})
    

  • Notice that the mongostat shows replication has halted and the SECONDARY is no longer applying inserts from the PRIMARY until the long running query returns

    insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn      set repl       time 
     *2196     *0     *0     *0       0     1|0       0   800m  4.13g   644m    115 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:14:49 
     *2124     *0     *0     *0       0     2|0       0   800m  4.13g   645m    113 oplog:0.0%          0       0|0     0|0   205b    10k     3 dynatest  SEC   23:14:50 
     *2180     *0     *0     *0       0     1|0       0   800m  4.13g   645m    114 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:14:51 
     *1191      1     *0     *0       0     2|0       0   800m  4.13g   648m     63 oplog:0.0%          0       0|0     1|0   205b    10k     3 dynatest  SEC   23:14:52 
        *0     *0     *0     *0       0     1|0       0   800m  4.15g   669m      0 oplog:0.0%          0       0|0     1|0    62b    10k     3 dynatest  SEC   23:14:53 
        *0     *0     *0     *0       0     2|0       0   800m  4.17g   683m      0 oplog:0.0%          0       0|0     1|0   205b    10k     3 dynatest  SEC   23:14:54 
        *0     *0     *0     *0       0     1|0       0   800m  4.17g   683m      0 oplog:0.0%          0       0|0     1|0    62b    10k     3 dynatest  SEC   23:14:55 
        *0     *0     *0     *0       0     2|0       0   800m  4.17g   684m      0 oplog:0.0%          0       0|0     1|0   205b    10k     3 dynatest  SEC   23:14:56 
        *0     *0     *0     *0       0     1|0       0   800m  4.17g   703m      0 oplog:0.0%          0       0|0     1|0    62b    10k     3 dynatest  SEC   23:14:57 
    *10006     *0     *0     *0       0     3|0       0   800m  4.18g   717m    528 oplog:0.0%          0       0|0     1|0   413b     4m     3 dynatest  SEC   23:14:58 
     *5278     *0     *0     *0       0     1|0       0   800m  4.18g   692m    279 oplog:0.0%          0       0|0     0|0    62b    10k     3 dynatest  SEC   23:14:59
    

Now, you can perform the same setup with a 2.6.9 PRIMARY and 2.6.9 SECONDARY. When running the same long running query on the SECONDARY, mongostat shows inserts do slow down when the long running query is executed, but does not completely block all replication like it did running 3.0.2.

insert  query update delete getmore command flushes mapped  vsize    res faults  locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn      set repl       time 
 *1982     *0     *0     *0       0     2|0       0   800m  4.08g   223m      0  test:8.4%          0       0|0     0|0   205b     3k     3 dynatest  SEC   23:21:32 
 *1969      1     *0     *0       0     1|0       0   800m  4.08g   224m      0  test:6.5%          0       0|0     1|0    62b     3k     3 dynatest  SEC   23:21:33 
   *83     *0     *0     *0       0     2|0       0   800m  4.09g   244m      0  test:1.3%          0       0|0     1|0   205b     3k     3 dynatest  SEC   23:21:34 
  *640     *0     *0     *0       0     1|0       0   800m  4.11g   269m      0  test:1.6%          0       0|0     1|0    62b     3k     3 dynatest  SEC   23:21:35 
 *1525     *0     *0     *0       0     2|0       0   800m  4.11g   270m      0  test:3.7%          0       0|0     1|0   205b     3k     3 dynatest  SEC   23:21:36 
 *2237     *0     *0     *0       0     1|0       0   800m  4.12g   271m      0  test:5.2%          0       0|0     1|0    62b     3k     3 dynatest  SEC   23:21:37 
 *4406     *0     *0     *0       0     2|0       0   800m  4.12g   288m      0 test:10.3%          0       0|0     1|0   205b     3k     3 dynatest  SEC   23:21:38 
 *4010     *0     *0     *0       0     2|0       0   800m  4.12g   297m      0 test:14.1%          0       0|0     0|0   279b     4m     3 dynatest  SEC   23:21:39 
 *2181     *0     *0     *0       0     2|0       0   800m  4.12g   297m      0  test:8.7%          0       0|0     0|0   205b     3k     3 dynatest  SEC   23:21:40 
 *2146     *0     *0     *0       0     1|0       0   800m  4.12g   277m      0  test:7.9%          0       0|0     0|0    62b     3k     3 dynatest  SEC   23:21:41

Participants:

 Description   

After upgrading a secondary node from 2.6.9 to 3.0.2, I noticed that long running queries appear to block replication and causes replication lag to increase. Downgrading the secondary back to 2.6.9 and the same long running query does not block replication.



 Comments   
Comment by Daniel Pasette (Inactive) [ 24/Apr/15 ]

Thanks for the detailed report and repro case. I believe this is a duplicate of SERVER-18190, which was just discovered yesterday afternoon. Please follow up on SERVER-18190 for updates.

Comment by Michael Kania [ 24/Apr/15 ]

3.0.2 is yielding 4,837 times.

2015-04-23T23:20:04.902-0700 I QUERY    [conn1] query test.test_collection query: { query: { latitude: { $gt: 0.0 } }, orderby: { _created_at: -1.0 } } planSummary: COLLSCAN, COLLSCAN cursorid:48532351730 ntoreturn:200000 ntoskip:0 nscanned:0 nscannedObjects:583213 scanAndOrder:1 keyUpdates:0 writeConflicts:0 numYields:4837 nreturned:35849 reslen:4194353 locks:{ Global: { acquireCount: { r: 4838 } }, MMAPV1Journal: { acquireCount: { r: 4838 } }, Database: { acquireCount: { r: 4838 } }, Collection: { acquireCount: { R: 4838 } } } 4173ms

Whereas 2.6.9 is only yielding 54 times.

2015-04-23T23:21:38.945-0700 [conn4] query test.test_collection query: { query: { latitude: { $gt: 0.0 } }, orderby: { _created_at: -1.0 } } planSummary: COLLSCAN, COLLSCAN cursorid:65478027908 ntoreturn:200000 ntoskip:0 nscanned:583213 nscannedObjects:583213 keyUpdates:0 numYields:54 locks(micros) r:10540843 nreturned:35849 reslen:4194353 5802ms

Generated at Thu Feb 08 03:46:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.