[SERVER-18908] Secondaries unable to keep up with primary under WiredTiger Created: 10/Jun/15  Updated: 31/Oct/16  Resolved: 28/Oct/16

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.3, 3.0.4
Fix Version/s: 3.4.0-rc2

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 11
Labels: WTcap
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File insert-3.0.4.patch     Text File insert-3.1.4.patch     HTML File lag-02.html     HTML File lag-03.html     HTML File lag-04.html     Text File lagC-3.0.4.patch     Text File lagC-3.1.4.patch     Text File lagD-3.0.4.patch     PNG File log.png     PNG File pri.png     PNG File search_near.png     PNG File sec.png    
Issue Links:
Depends
depends on SERVER-19122 Out-of-order _id index inserts on sec... Closed
depends on WT-1989 Improvements to log slot freeing to i... Closed
depends on SERVER-18983 Process oplog inserts, and applying, ... Closed
depends on SERVER-15344 Make replWriter thread pool size tunable Closed
depends on SERVER-18982 Apply replicated inserts as inserts Closed
Duplicate
is duplicated by SERVER-21444 WiredTiger replica set member is gett... Closed
Related
related to SERVER-19668 Waiting time for write concern increa... Closed
related to SERVER-21858 A high throughput update workload in ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Quint Iteration 7, QuInt A (10/12/15), QuInt C (11/23/15), Repl 2016-08-29, Repl 2016-09-19, Repl 2016-10-10, Repl 2016-10-31
Participants:

 Description   
  • hardware: 24 CPUs, 64 GB memory, SSD (all mongods and clients on same machine)
  • start 3-member replica set with following options:

    mongod --oplogSize 50 --storageEngine wiredTiger --nojournal --replSet ...
    

    Note: repros with journal also, ran without journal to rule out that as a cause.

  • Simple small document insert workload: 5-16 threads (number doesn't matter to the repro) each inserting small documents {_id:..., x:0} in batches of 10k

Replica lag grows unbounded as secondaries process ops at maybe 50-80% the rate of the primary.

Some stats of note:

primary

secondary

  • op rate on secondary is maybe half that on primary
  • ops in flight (i.e. active queues) is much less even on the secondary, although that isn't reflected in the reported op rates
  • secondary is executing far more search near calls, about one per document, vs what appears to be about one every 100 documents on primary

Will get stack traces.



 Comments   
Comment by Mathias Stearn [ 28/Oct/16 ]

Significant work on secondary performance was done under SERVER-24242. That concludes the work we intend to do for 3.4. If this continues to be an issue, please open a new ticket.

Comment by Jinwen Zou [ 29/Dec/15 ]

It would be very appreciated if you can share how 3.2.0 was improved in pipelining and parallelism(some implementation details). To be honest, it would be very good show cases or new features for marketing.

Beside, serious applications/shops will always 1-2 major release behind the bleeding-edge release, backporting these improvements to mongo 3.0.x is highly demanded in many shops.

Comment by Martin Bligh [ 15/Dec/15 ]

The code has been changed significantly since that comment was written.
Though there is still other work we will look at, the implementation in 3.2.0 is much improved, with much more pipelining and parallelism.

Comment by Jinwen Zou [ 15/Dec/15 ]

On the primary, each connection thread writes its own ops to the oplog, which proceed in parallel under WT with document-level concurrency. On the secondary, writing to the oplog is done sequentially by the sync thread, making it a serial bottleneck that doesn't exist on the primary. Under mmapv1 with collection-level locking, writing to the oplog is a serial bottlenck on both the primary and the secondary.
In the following 24 samples, runSyncThread spent 67% of its time in applyOps waiting for the n worker threads to finish applying the ops, and 29% of its time writing those ops to the oplog sequentially. This is about the right ratio to account for the performance difference we see between primary and secondary.

This is very inefficient implementation of replication. how about split up the replication task to multiple threads with minimal sequential operations: ex:
1. runSyncThread is dedicated thread(or pool of threads) that only takes care of oplog shipping, don't need to wait the log apply threads, what they have to do is simply copy the oplog from source(you can add compression function if needed to save bandwidth).
2 apply threads, oplog apply main thread + multiple apply worker threads. apply main thread just scan through oplog and distribute entries to concurrent apply worker threads that split up their jobs by collection + _id.

Comment by Githook User [ 14/Sep/15 ]

Author:

{u'username': u'martinbligh', u'name': u'Martin Bligh', u'email': u'mbligh@mongodb.com'}

Message: SERVER-18908: Write secondary oplog as a vector
Branch: master
https://github.com/mongodb/mongo/commit/21c6a574d2cd497089d41d1b418fdfdabf7fc953

Comment by Bruce Lucas (Inactive) [ 15/Jun/15 ]

I've opened SERVER-18982 to track processing inserts as inserts, and SERVER-18983 to track parallel processing of oplog inserts.

Comment by Bruce Lucas (Inactive) [ 15/Jun/15 ]

The number of replication worker threads on the secondary is hard-coded at 16. This can become a bottleneck creating replication lag if the primary has more concurrency than that. For example, with 12 application threads, the following replication lags were measured at the end of a 110-second run for varying numbers of replication worker threads:

worker	repl
threads	lag
8	30
16	17
24	11

To address this issue we can implement SERVER-15344 to make the number of worker threads tunable.

Comment by Bruce Lucas (Inactive) [ 15/Jun/15 ]

Did some more careful measurements evaluating two changes:

  1. process collection inserts from primary as inserts on secondary, instead of upserts as is currently done
  2. disable insertion into oplog on secondary to approximate speedup expected by doing oplog insertion in parallel on worker threads instead of sequentially on sync thread

In an insert workload taking about 90 seconds:

  • baseline, with neither patch, replication lag grew steadily to 36 seconds
  • with first patch processing collection inserts as inserts: replication lag grew steadily to 21 seconds
  • with second patch disabling insertion into oplog: replication lag grew steadily to 22 seconds
  • with both patches: replication hovered around 0-4 seconds and did not appear to grow.

So it appears that a combination of both patches, if fully implemented, would address this issue.

(Note that these results were obtained with 8 application threads. Since the number of replication worker threads is hard-coded at 16, in this situation the time spent inserting documents into the collection may actually be less on the secondary than it was on the primary. This means that although the two patches described eliminated growing replication lag in this case, the better performance on the secondary for the inserts than on the primary may be hiding other inefficiencies on the secondary.)

Comment by Bruce Lucas (Inactive) [ 12/Jun/15 ]

On the primary, each connection thread writes its own ops to the oplog, which proceed in parallel under WT with document-level concurrency. On the secondary, writing to the oplog is done sequentially by the sync thread, making it a serial bottleneck that doesn't exist on the primary. Under mmapv1 with collection-level locking, writing to the oplog is a serial bottlenck on both the primary and the secondary.

In the following 24 samples, runSyncThread spent 67% of its time in applyOps waiting for the n worker threads to finish applying the ops, and 29% of its time writing those ops to the oplog sequentially. This is about the right ratio to account for the performance difference we see between primary and secondary.

Comment by Eric Milkie [ 10/Jun/15 ]

One idea we had is that instead of doing upserts for inserts, we could simply do real inserts and ignore duplicate key errors. The idempotency logic should still work.

Comment by Bruce Lucas (Inactive) [ 10/Jun/15 ]

adapted test to mmpav1: 8 threads inserting into 8 collections
rough numbers on this test from eyeballing mongostat:

       pri    sec
wt     150k/s 100k/s
mmapv1  60k/s  60k/s

The mmapv1 secondary did appear slightly slower, 5% at most from eyeballing mongostat
and replication lag did grow, albeit much more slowly that with WT

Comment by Bruce Lucas (Inactive) [ 10/Jun/15 ]

The search near appers to be because on the secondary we do inserts as upserts to make oplog replay idempotent, and that requires a lookup of the _id. If this is the cause (TBD whether it is) could be related to SERVER-18823, where we also see __wt_btcur_search_near playing a role.

Comment by Eric Milkie [ 10/Jun/15 ]

What does the lag look like for the same parameters except using mmapv1?
I wonder if this is simply a replication batching algorithm problem.

Generated at Thu Feb 08 03:49:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.