[SERVER-21858] A high throughput update workload in a replica set can cause starvation of secondary reads Created: 11/Dec/15  Updated: 06/Dec/22  Resolved: 21/Dec/17

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.7, 3.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Hows Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File 307.html     HTML File 320.html     PNG File lag.png     File repro.rb     HTML File ss-singlethread.html    
Issue Links:
Duplicate
duplicates SERVER-20328 Allow secondary reads while applying ... Closed
Related
is related to SERVER-18908 Secondaries unable to keep up with pr... Closed
is related to SERVER-18983 Process oplog inserts, and applying, ... Closed
is related to SERVER-31359 when large inserts into mongo, lots ... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Launch mongod's:

./mongod --replSet foo --storageEngine wiredTiger --dbpath /mnt/work/data/rs0 --oplogSize 1000 --wiredTigerCacheSizeGB 4 --logpath /mnt/work/data/rs0/mongod.log --fork
./mongod --replSet foo --storageEngine wiredTiger --dbpath /mnt/work/data/rs1 --oplogSize 1000 --wiredTigerCacheSizeGB 4 --port 27018 --logpath /mnt/work/data/rs1/mongod.log --fork

Setup the replset:

rs.initiate()
rs.add("localhost:27018")

Run javascript against the primary to seed documents

var INT_MAX=10000000;
for(x=0;x<INT_MAX/1000;x++){
	var bulk = db.test.initializeUnorderedBulkOp()
	for(i=0;i<1000;i++){
		var num = (x*1000)+i;
		bulk.insert({_id:num,x:num,y:num});
	}
	bulk.execute();
}

Execute the attached repro.rb

Run the following javascript on the secondary to monitor performance. You should see an initially stable number of query times, which will then blow out to seconds and minutes:

while(true){
var INT_MAX=10000000;
start = new Date()
rs.slaveOk()
for(i=0;i<10000;i++){
	var docNum = Math.floor(Math.random()*INT_MAX);
	db.test.findOne({_id:docNum});
}
print("Took: " +(new Date() - start) + "ms")
sleep(1000);
}

Participants:

 Description   

Under an update only workload on the primary it is possible to starve out readers on a secondary and cause large replication delays.

Workload:
On Primary:

  1. 1000 small updates (integer sets, increments and an unset)
  2. 10 massive (500kb) updates

On Secondary:

  1. 10000 findOne's


 Comments   
Comment by Asya Kamsky [ 21/Dec/17 ]

Duplicate of SERVER-20328

Comment by Martin Bligh [ 30/Dec/15 ]

The existing design only allows either reads or writes on the secondary at any one time. Under normal circumstances, these are interleaved by alternating between apply batches on the secondary and allowing reads.

The current plan to address this involves using a snapshot per application batch on the secondary - we can allow reads from that snapshot whilst the writes progress, without breaking the consistency order that data appears in. However, it is not a small change to make.

Comment by Feng Qu [ 29/Dec/15 ]

what's plan for fixing this issue? this caused issue we have in CS-23893.

Comment by Bruce Lucas (Inactive) [ 15/Dec/15 ]

zhifan, I meant the latter. It is expected that applying the oplog blocks reads, and this will be especially the case when the secondary is working hard to catch up to the primary and therefore is applying the oplog more-or-less continuously. The question is why does the secondary struggle to keep up with the primary. There are known issues in this area, but the question is whether the particular issue that seems to be hit in this repro, and in your use case, is one of the known issues.

Comment by a zhifan [ 15/Dec/15 ]

Bruce.Lucas I didn't understand you well. Do you mean why the applying oplog blocked read? or do you mean why the write traffic on primary is ok but the slave can't catch up with the primary?

Comment by David Hows [ 15/Dec/15 ]

Have just run another test with nojournal. In both the 20 thread and 1 thread cases this made no difference in the issue as reported, everything still eventually begins to lag and the read throughput drops off.

Comment by Bruce Lucas (Inactive) [ 14/Dec/15 ]

david.hows, there was a bug (SERVER-21305) in the computation of timeAcquiringMicros that makes the actual number incorrect; fixed in 3.0.8. However you can still use it to determine if there is queuing on the global lock, you just won't be able to easily determine the length of the queue. SERVER-21859 means that you can't use the globalLock.currentQueue stat to determine that, because time spent queueing on every lock but the global lock shows up in globalLock.currentQueue.

zhifan, you are correct that it is expected that continuous replication workload can starve secondary reads. I think the goal is to determine why, under conditions when it seems there should be excess capacity at the secondary to do secondary reads, it may not be the case because, as it appears, sometimes the secondary struggles to keep up with the primary, as indicated by growing replication lag, at which time secondary reads become starved.

Comment by a zhifan [ 14/Dec/15 ]

Hi Hows:
Why don't we think it is expected since whenever the slave is applying an oplog, it will block all the reads on slaves? In your test case, there are continuously update on primary, so probably there are continuously update on slave, the the read will be blocked (waiting for global lock) in a extremely high possibility.

Comment by David Hows [ 14/Dec/15 ]

Just to untangle things and provide a checkpoint as to what I've found;
There are currently two metrics that I have been using to track the issue at hand. I will focus on the single threaded case that I have been working with today and running with MongoDB 3.2;

  1. When load is being applied to the primary repl lag is seen of 30+ seconds
  2. When load is being applied to the primary, secondary reads will sporadically take very long times to execute.
    Long Operations on Secondary under 3.2

    2015-12-14T17:11:49.567+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 5865940.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 814498 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 814ms
    2015-12-14T17:11:50.205+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 5348929.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1003 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 569ms
    2015-12-14T17:11:51.086+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 9562842.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1399 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 785ms
    2015-12-14T17:11:52.624+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 6098631.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1530898 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1531ms
    2015-12-14T17:11:53.366+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 8450642.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 741045 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 741ms
    2015-12-14T17:11:54.843+1100 I COMMAND  [conn8] command test.test command: find { find: "test", filter: { _id: 3164850.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1472757 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1472ms

The most interesting thing about these operations is that 3/5 show that almost all of operational execution is spent in timeAcquiringMicros. I'm uncertain if the other two would be due to another bug in how timeAcquiringMicros is calculated, if those operations are legitimately taking hundreds of milliseconds for an _id lookup or is a variation on SERVER-21859 or something similar.

I also tested with the latest MongoDB master and develop and this showed some initial improvements, but over several runs it turns out there was little difference from 3.2. The only noticeable difference was that after a longer run-time the repl lag seems to stabilise under the Master+Develop version.

Under the single threaded version, the slowdown on the secondaries is very strongly associated with checkpoints in WT (as per the attached ss-singlethread.html), this may also explain the lack of high timeAcquiringMicros on some of the above queries.

As it stands I'm not certain if this is SERVER-18908/SERVER-18983, but my suspicions are becoming stronger that bruce.lucas is correct as the high timeAcquiringMicros is for a global lock and consistent with the explanation above.

Comment by David Hows [ 14/Dec/15 ]

alexander.gorrod, I have amended my comment to make my thoughts more clear, I suspect it is less likely that SERVER-18983 is involved as that tickets current focus (as I read it) is on adding a method to apply the oplog on secondaries in parallel.

Other than the size of the batch jobs that I am working with there should be no other parallelisation within the workload I tested with.

Comment by Alexander Gorrod [ 14/Dec/15 ]

Thanks for digging deeper david.hows Does the above information make it more or less likely that this is the same problem being tracked by SERVER-18983?

Comment by David Hows [ 14/Dec/15 ]

Modified the script today to do some further diving. Even with only the one thread inserting, I can see times where it takes over 60 seconds to finish the small burst of 10K reads (baseline is between 1-2 seconds) and I can see repl lag spiking greatly during that time.

It does seem that the system is able to catch up eventually, with at most somewhere near a minutes delay, but this load is one primary receiving one threads worth of update operations only and the secondary still having trouble keeping pace, which means that this may not exclusively be an issue with parallel on primaries vs serial on secondaries.

Comment by Bruce Lucas (Inactive) [ 11/Dec/15 ]

  • Replication lag builds under the insert workload, as indicated by the repl buffer stats. (The 3.2 ftdc data will have a more direct indication of this.) This means replication is struggling, and that is what is probably what is starving the reads, because we "stop the world" using the global lock during replication.
  • This does not show up as read queues because of SERVER-21859, but the global lock time acquiring stat supports this as you point out. The WT concurrentTransactions metric is also consistent with this - the reads are not queuing in WT.
  • The main operations done by replication are applying the ops and inserting them into the oplog. Time spent applying the ops is reported in "repl apply batches totalMillis", but this is not large, so the replication lag is most likely related to inserting ops into the oplog, and that is also consistent with the customer's observation.

If the above is correct, this is a known issue, and there are a couple tickets tracking work in this area - SERVER-18908, SERVER-18983. In this case for example you have 20 threads on the primary inserting into the oplog in parallel whereas that is serialized on the secondary, so that may account for the lag. In any case stack traces would help diagnose this of course.

Comment by David Hows [ 11/Dec/15 ]

Current working hypothesis is that there is some delay in the processing of replication operations that is causing the delay.

The aim with the reproduction workload was to generate a large number of deletes at once when truncating the oplog to allow for the insertion of a new very large document in an update only workload.

Initially, I suspected this may have been related to tombstones within the oplog collection in WiredTiger, thus the large/small update workload in a replica set. However, as SERVER-21027 is fixed within 3.2.0, I am uncertain if this is the cause.

Comment by David Hows [ 11/Dec/15 ]

Attaching the results of timeseries on the serverStatus captures on 2 repro runs.

The 3.0.7 run is far longer as it was the initial reproduction and some changes were made to reproducer to get things going. You can also see a point when I stopped the reproduction to confirm that execution eventually returned to normal (it did).

The 3.2.0 run is far more indicative, as it was just load and reproduce.

Probably the most telling symptom at this stage is the Global timeAcquiringMicros r in the 3.2.0 timeseries attachment, it begins spiking as soon as the ability to read falls behind.

Comment by David Hows [ 11/Dec/15 ]

The repro script requires the MongoDB ruby driver v2.2.0.

Generated at Thu Feb 08 03:58:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.