[SERVER-18983] Process oplog inserts, and applying, on the secondary in parallel Created: 15/Jun/15 Updated: 02/Oct/17 Resolved: 02/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Backlog - Tuning Team |
| Resolution: | Done | Votes: | 4 |
| Labels: | mms-s | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Sprint: | QuInt A (10/12/15) | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||
| Description |
|
Do not wait for the apply of the oplog entries before recording them in the oplog. This means both of these operation can be done concurrently instead of serially now that we record the boundaries of the batch and recover correctly but removing the oplog entries record during failures. old description |
| Comments |
| Comment by Mathias Stearn [ 02/Oct/17 ] | |||||
|
This work was done a while ago under | |||||
| Comment by Githook User [ 12/Oct/15 ] | |||||
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: | |||||
| Comment by Githook User [ 02/Oct/15 ] | |||||
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: Revert " This reverts commit 3937e8a5a855aebc4c8e16206fd69c863f567e15. | |||||
| Comment by Githook User [ 01/Oct/15 ] | |||||
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: | |||||
| Comment by Martin Bligh [ 01/Oct/15 ] | |||||
|
Re-opening this ticket and assigning back to me, because we still have to investigate Bruce's original point of doing this in parallel Most of the work Scott has been doing is a pre-req for this, as it liberates us to do oplog vs collections in much less strict order. | |||||
| Comment by Githook User [ 01/Oct/15 ] | |||||
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: | |||||
| Comment by Githook User [ 01/Oct/15 ] | |||||
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: | |||||
| Comment by Scott Hernandez (Inactive) [ 09/Sep/15 ] | |||||
|
We are moving forward with this work to improve performance during replication. | |||||
| Comment by Scott Hernandez (Inactive) [ 03/Aug/15 ] | |||||
|
This work is on hold while other work is being done on inserting an array (vector) of documents. The multi-insert, at the integration+storage layer, is showing great performance improvements during testing and would remove the need for the work defined in this issue. | |||||
| Comment by Scott Hernandez (Inactive) [ 29/Jun/15 ] | |||||
|
bruce.lucas@10gen.com, I think we are fine as long as they commit in a group. The issue will be if they are committed out of order such that some are missing from the oplog, and there are entries with larger ts values than the missing entries, since that will be used as the high water mark to start applying on recovery. | |||||
| Comment by Bruce Lucas (Inactive) [ 29/Jun/15 ] | |||||
|
Thanks Scott. So it seems that within the current design parallelizing oplog inserts would need to be a separate step as it is now, with a separate set of worker threads. TBD how that would perform - my gut feeling is it would be ok, will give it a try. Any functional issues with that approach? | |||||
| Comment by Scott Hernandez (Inactive) [ 29/Jun/15 ] | |||||
|
Bruce, we currently don't insert the oplog entries until after all database operations have completed successfully, since we use those oplog entries as markers for where to apply from on failure/recovery, if we encounter an error while applying a batch. By doing them in the worker threads I believe this will break recovery on a failed batch apply, which will result in missing/skipped oplog entries. We would need a new "batchStartOptime" to match the "minValid" (really "batchEndOptime") to ensure correctly recovery. This work will be a bit more involved but follows future designs needed for writing to the oplog before/during application ( | |||||
| Comment by Bruce Lucas (Inactive) [ 24/Jun/15 ] | |||||
|
Create | |||||
| Comment by Bruce Lucas (Inactive) [ 24/Jun/15 ] | |||||
|
Identical oplog order on the primary and secondary is maintained by computing an identical RecordId from the oplog entry timestamp, rather than using the normal monotonic RecordId. When used on the primary the RecordIds are inserted nearly in monotonically increasing order, and that is a case that WT optimizes for. However if oplog entry insertion is parallelized on the secondary and each worker thread is independently inserting entries with a RecordId dictated by the oplog entry, the insertions in general will not be in order by RecordId, which is a less optimal path in WT, and this prevents parallelization of oplog inserts on the secondary from achieving performance parity with the parallel oplog inserts on the primary. To measure the impact of this, oplog entry insertion on the secondary was parallelized, and insertion using the computed RecordId was compared with insertion using the standard monotonic RecordId. (Note that using the standard monotonic RecordId on the secondary creates an invalid oplog because it is no longer in timestamp order, so this change is for performance evaluation only.) Secondary processing rate relative to primary was measured by comparing number of entries inserted into oplog on secondary with number on primary at a particular point in time during the run:
It's unclear whether this issue can be addressed in the mongod layer, or whether an improvement in WT for the out-of-order insertion case is needed. | |||||
| Comment by Andy Schwerin [ 16/Jun/15 ] | |||||
|
On WiredTiger, that current implementation of the oplog is a btree, so doing the inserts out of order is OK so long as we're also prohibiting oplog reads. I'm pretty sure that we do prohibit oplog reads during the oplog application process, so it should be easy to try moving the oplog writes into the threads that do the per-document update work. Should be easy to test, anyhow. | |||||
| Comment by Eric Milkie [ 16/Jun/15 ] | |||||
|
We're going to experiment to see if this is an actual solution or if this is a problem that can solved in a different way. | |||||
| Comment by Eric Milkie [ 16/Jun/15 ] | |||||
|
The secondary's oplog needs to match the order of the primary's oplog. We will need help from the storage system in order to parallelize this. |