[SERVER-18983] Process oplog inserts, and applying, on the secondary in parallel Created: 15/Jun/15  Updated: 02/Oct/17  Resolved: 02/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Tuning Team
Resolution: Done Votes: 4
Labels: mms-s
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-20655 During recovery replicas should trunc... Closed
depends on WT-1979 Lower performance of out-of-order ins... Closed
depends on SERVER-20326 Record "apply" batch boundaries durin... Closed
is depended on by SERVER-18908 Secondaries unable to keep up with pr... Closed
Related
related to SERVER-21858 A high throughput update workload in ... Closed
Backwards Compatibility: Fully Compatible
Sprint: QuInt A (10/12/15)
Participants:
Linked BF Score: 0

 Description   

Do not wait for the apply of the oplog entries before recording them in the oplog. This means both of these operation can be done concurrently instead of serially now that we record the boundaries of the batch and recover correctly but removing the oplog entries record during failures.

old description
Inserts into the oplog on the primary are done in parallel by each connection thread, whereas they are done on the secondary serially by the sync thread. This means that the oplog inserts are considerably slower on the secondary, which can create replication lag. See this comment for more information.



 Comments   
Comment by Mathias Stearn [ 02/Oct/17 ]

This work was done a while ago under SERVER-24242 and SERVER-7200

Comment by Githook User [ 12/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-18983: Apply oplog and record in oplog concurrently
Branch: master
https://github.com/mongodb/mongo/commit/cc1f48bce42728f3af21e8c6d3a9766f3675ac8a

Comment by Githook User [ 02/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: Revert "SERVER-18983: Apply oplog and record in oplog concurrently"

This reverts commit 3937e8a5a855aebc4c8e16206fd69c863f567e15.
Branch: master
https://github.com/mongodb/mongo/commit/f25e8acf1a160bbfa39035888bb026049b10ae22

Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-18983: just check for timing field, not validation of value
Branch: master
https://github.com/mongodb/mongo/commit/ca4481c3269768e196ad8d7594c0d84dfe4f4593

Comment by Martin Bligh [ 01/Oct/15 ]

Re-opening this ticket and assigning back to me, because we still have to investigate Bruce's original point of doing this in parallel

Most of the work Scott has been doing is a pre-req for this, as it liberates us to do oplog vs collections in much less strict order.

Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-18983: Apply oplog and record in oplog concurrently
Branch: master
https://github.com/mongodb/mongo/commit/3937e8a5a855aebc4c8e16206fd69c863f567e15

Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-18983: enforce batch write durability in setMinvalid function
Branch: master
https://github.com/mongodb/mongo/commit/c4c3722e288bd13f40a5404cc20d44d077d469ca

Comment by Scott Hernandez (Inactive) [ 09/Sep/15 ]

We are moving forward with this work to improve performance during replication.

Comment by Scott Hernandez (Inactive) [ 03/Aug/15 ]

This work is on hold while other work is being done on inserting an array (vector) of documents. The multi-insert, at the integration+storage layer, is showing great performance improvements during testing and would remove the need for the work defined in this issue.

Comment by Scott Hernandez (Inactive) [ 29/Jun/15 ]

bruce.lucas@10gen.com, I think we are fine as long as they commit in a group. The issue will be if they are committed out of order such that some are missing from the oplog, and there are entries with larger ts values than the missing entries, since that will be used as the high water mark to start applying on recovery.

Comment by Bruce Lucas (Inactive) [ 29/Jun/15 ]

Thanks Scott. So it seems that within the current design parallelizing oplog inserts would need to be a separate step as it is now, with a separate set of worker threads. TBD how that would perform - my gut feeling is it would be ok, will give it a try. Any functional issues with that approach?

Comment by Scott Hernandez (Inactive) [ 29/Jun/15 ]

Bruce, we currently don't insert the oplog entries until after all database operations have completed successfully, since we use those oplog entries as markers for where to apply from on failure/recovery, if we encounter an error while applying a batch. By doing them in the worker threads I believe this will break recovery on a failed batch apply, which will result in missing/skipped oplog entries. We would need a new "batchStartOptime" to match the "minValid" (really "batchEndOptime") to ensure correctly recovery. This work will be a bit more involved but follows future designs needed for writing to the oplog before/during application (SERVER-7200) of those entries so the work will not be wasted, but is not minor, or without (high) risk.

Comment by Bruce Lucas (Inactive) [ 24/Jun/15 ]

Create WT-1979 to investigate whether the issue describe in the preceding comment can be addressed in WT.

Comment by Bruce Lucas (Inactive) [ 24/Jun/15 ]

Identical oplog order on the primary and secondary is maintained by computing an identical RecordId from the oplog entry timestamp, rather than using the normal monotonic RecordId. When used on the primary the RecordIds are inserted nearly in monotonically increasing order, and that is a case that WT optimizes for. However if oplog entry insertion is parallelized on the secondary and each worker thread is independently inserting entries with a RecordId dictated by the oplog entry, the insertions in general will not be in order by RecordId, which is a less optimal path in WT, and this prevents parallelization of oplog inserts on the secondary from achieving performance parity with the parallel oplog inserts on the primary.

To measure the impact of this, oplog entry insertion on the secondary was parallelized, and insertion using the computed RecordId was compared with insertion using the standard monotonic RecordId. (Note that using the standard monotonic RecordId on the secondary creates an invalid oplog because it is no longer in timestamp order, so this change is for performance evaluation only.) Secondary processing rate relative to primary was measured by comparing number of entries inserted into oplog on secondary with number on primary at a particular point in time during the run:

                                               secondary/primary   relative
                                                                   performance
sequential insert, normal computed RecordId    17123525/20187676 = 0.848
parallel insertion, normal computed RecordId   17238547/20046073 = 0.860
parallel insertion, monotonic RecordId         20056367/20060385 = 0.999

  • parallelizing oplog insertion on the secondary by itself does little if anything to achieve parity with the primary
  • however eliminating the out-of-order inserts into the oplog on the secondary by using a monotonic RecordId combined with parallelizing the oplog inserts achieves parity btween the primary and secondary

It's unclear whether this issue can be addressed in the mongod layer, or whether an improvement in WT for the out-of-order insertion case is needed.

Comment by Andy Schwerin [ 16/Jun/15 ]

On WiredTiger, that current implementation of the oplog is a btree, so doing the inserts out of order is OK so long as we're also prohibiting oplog reads. I'm pretty sure that we do prohibit oplog reads during the oplog application process, so it should be easy to try moving the oplog writes into the threads that do the per-document update work. Should be easy to test, anyhow.

Comment by Eric Milkie [ 16/Jun/15 ]

We're going to experiment to see if this is an actual solution or if this is a problem that can solved in a different way.

Comment by Eric Milkie [ 16/Jun/15 ]

The secondary's oplog needs to match the order of the primary's oplog. We will need help from the storage system in order to parallelize this.
One idea specifically for WiredTiger would be the ability to dictate the RecordId index order.
Also, using the bulk loader feature to insert each sorted batch of oplog entries may help.

Generated at Thu Feb 08 03:49:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.