[SERVER-21107] Improve protocol version 1 replication throughput Created: 23/Oct/15 Updated: 10/Mar/16 Resolved: 19/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Critical - P2 |
| Reporter: | Martin Bligh | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Sprint: | Repl B (10/30/15), Repl C (11/20/15), Repl D (12/11/15) | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||
| Description |
|
On secondaries in PV1, replicated writes require waiting for journaling before updating their oplog position, for correctness. This change has caused performance and throughput to drop relative to the old PV0 replication process, as show below when PV1 became the default for replica sets. Version, Primary inserts/s, Secondary inserts/s, Ratio
|
| Comments |
| Comment by Scott Hernandez (Inactive) [ 19/Nov/15 ] | |
|
Work was done in the linked issues so closing this umbrella. | |
| Comment by Martin Bligh [ 26/Oct/15 ] | |
|
Not suggesting it as a fix, just isolating where the issue is | |
| Comment by Eric Milkie [ 26/Oct/15 ] | |
|
We're going to move the waiting to a different codepath to allow for better pipelining in | |
| Comment by Eric Milkie [ 26/Oct/15 ] | |
|
While that action does make things faster, it's not something we can viably do. It would be like commenting out the part of the code that does the write. It could be way faster if we didn't have to write things... | |
| Comment by Martin Bligh [ 26/Oct/15 ] | |
|
Flipping this to false fixes the regression
| |
| Comment by Matt Dannenberg [ 26/Oct/15 ] | |
|
This indicates that there is a 35% regression on secondary perf in PV1 compared to PV0. That commit activated a slower path, but it did not create the slowness in the path. | |
| Comment by Martin Bligh [ 23/Oct/15 ] | |
|
Repro - see attached scripts:
| |
| Comment by Martin Bligh [ 23/Oct/15 ] | |
|
matt.dannenberg, scotthernandez, schwerin: looks like the main culprit is: d789bca4c9fe76cd4d5375e66e281ed5a349e8fd matt.dannenberg@10gen.com 35% regression on secondary perf. With parallel oplog fixed it's actually 50% I think |