[SERVER-21858] A high throughput update workload in a replica set can cause starvation of secondary reads Created: 11/Dec/15 Updated: 06/Dec/22 Resolved: 21/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.0.7, 3.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Hows | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | Launch mongod's:
Setup the replset:
Run javascript against the primary to seed documents
Execute the attached repro.rb Run the following javascript on the secondary to monitor performance. You should see an initially stable number of query times, which will then blow out to seconds and minutes:
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Under an update only workload on the primary it is possible to starve out readers on a secondary and cause large replication delays. Workload:
On Secondary:
|
| Comments |
| Comment by Asya Kamsky [ 21/Dec/17 ] | ||||||
|
Duplicate of | ||||||
| Comment by Martin Bligh [ 30/Dec/15 ] | ||||||
|
The existing design only allows either reads or writes on the secondary at any one time. Under normal circumstances, these are interleaved by alternating between apply batches on the secondary and allowing reads. The current plan to address this involves using a snapshot per application batch on the secondary - we can allow reads from that snapshot whilst the writes progress, without breaking the consistency order that data appears in. However, it is not a small change to make. | ||||||
| Comment by Feng Qu [ 29/Dec/15 ] | ||||||
|
what's plan for fixing this issue? this caused issue we have in CS-23893. | ||||||
| Comment by Bruce Lucas (Inactive) [ 15/Dec/15 ] | ||||||
|
zhifan, I meant the latter. It is expected that applying the oplog blocks reads, and this will be especially the case when the secondary is working hard to catch up to the primary and therefore is applying the oplog more-or-less continuously. The question is why does the secondary struggle to keep up with the primary. There are known issues in this area, but the question is whether the particular issue that seems to be hit in this repro, and in your use case, is one of the known issues. | ||||||
| Comment by a zhifan [ 15/Dec/15 ] | ||||||
|
Bruce.Lucas I didn't understand you well. Do you mean why the applying oplog blocked read? or do you mean why the write traffic on primary is ok but the slave can't catch up with the primary? | ||||||
| Comment by David Hows [ 15/Dec/15 ] | ||||||
|
Have just run another test with nojournal. In both the 20 thread and 1 thread cases this made no difference in the issue as reported, everything still eventually begins to lag and the read throughput drops off. | ||||||
| Comment by Bruce Lucas (Inactive) [ 14/Dec/15 ] | ||||||
|
david.hows, there was a bug ( zhifan, you are correct that it is expected that continuous replication workload can starve secondary reads. I think the goal is to determine why, under conditions when it seems there should be excess capacity at the secondary to do secondary reads, it may not be the case because, as it appears, sometimes the secondary struggles to keep up with the primary, as indicated by growing replication lag, at which time secondary reads become starved. | ||||||
| Comment by a zhifan [ 14/Dec/15 ] | ||||||
|
Hi Hows: | ||||||
| Comment by David Hows [ 14/Dec/15 ] | ||||||
|
Just to untangle things and provide a checkpoint as to what I've found;
The most interesting thing about these operations is that 3/5 show that almost all of operational execution is spent in timeAcquiringMicros. I'm uncertain if the other two would be due to another bug in how timeAcquiringMicros is calculated, if those operations are legitimately taking hundreds of milliseconds for an _id lookup or is a variation on I also tested with the latest MongoDB master and develop and this showed some initial improvements, but over several runs it turns out there was little difference from 3.2. The only noticeable difference was that after a longer run-time the repl lag seems to stabilise under the Master+Develop version. Under the single threaded version, the slowdown on the secondaries is very strongly associated with checkpoints in WT (as per the attached ss-singlethread.html), this may also explain the lack of high timeAcquiringMicros on some of the above queries. As it stands I'm not certain if this is | ||||||
| Comment by David Hows [ 14/Dec/15 ] | ||||||
|
alexander.gorrod, I have amended my comment to make my thoughts more clear, I suspect it is less likely that Other than the size of the batch jobs that I am working with there should be no other parallelisation within the workload I tested with. | ||||||
| Comment by Alexander Gorrod [ 14/Dec/15 ] | ||||||
|
Thanks for digging deeper david.hows Does the above information make it more or less likely that this is the same problem being tracked by | ||||||
| Comment by David Hows [ 14/Dec/15 ] | ||||||
|
Modified the script today to do some further diving. Even with only the one thread inserting, I can see times where it takes over 60 seconds to finish the small burst of 10K reads (baseline is between 1-2 seconds) and I can see repl lag spiking greatly during that time. It does seem that the system is able to catch up eventually, with at most somewhere near a minutes delay, but this load is one primary receiving one threads worth of update operations only and the secondary still having trouble keeping pace, which means that this may not exclusively be an issue with parallel on primaries vs serial on secondaries. | ||||||
| Comment by Bruce Lucas (Inactive) [ 11/Dec/15 ] | ||||||
|
If the above is correct, this is a known issue, and there are a couple tickets tracking work in this area - | ||||||
| Comment by David Hows [ 11/Dec/15 ] | ||||||
|
Current working hypothesis is that there is some delay in the processing of replication operations that is causing the delay. The aim with the reproduction workload was to generate a large number of deletes at once when truncating the oplog to allow for the insertion of a new very large document in an update only workload. Initially, I suspected this may have been related to tombstones within the oplog collection in WiredTiger, thus the large/small update workload in a replica set. However, as | ||||||
| Comment by David Hows [ 11/Dec/15 ] | ||||||
|
Attaching the results of timeseries on the serverStatus captures on 2 repro runs. The 3.0.7 run is far longer as it was the initial reproduction and some changes were made to reproducer to get things going. You can also see a point when I stopped the reproduction to confirm that execution eventually returned to normal (it did). The 3.2.0 run is far more indicative, as it was just load and reproduce. Probably the most telling symptom at this stage is the Global timeAcquiringMicros r in the 3.2.0 timeseries attachment, it begins spiking as soon as the ability to read falls behind. | ||||||
| Comment by David Hows [ 11/Dec/15 ] | ||||||
|
The repro script requires the MongoDB ruby driver v2.2.0. |