[SERVER-37418] Background index builds should batch collection scan reads and inserts into the index Created: 01/Oct/18 Updated: 27/Oct/23 Resolved: 17/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Louis Williams |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | nyc | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Sprint: | Storage NYC 2019-01-14, Storage NYC 2019-01-28 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Story Points: | 0 | ||||||||||||||||
| Description |
|
For every single document in a collection, background index builds retrieve a document, call saveState() (which resets the cursor) and restoreState() (which repositions it) after every single insert. It would be more efficient to batch reads on the collection and inserts into the index so the read cursors are reset less often. If we want to take advantage of read_once cursors, this will work around having the read the same page into cache when there are multiple documents in each page. |
| Comments |
| Comment by Louis Williams [ 17/Jan/19 ] |
|
This is no longer necessary as |
| Comment by Louis Williams [ 11/Jan/19 ] |
|
Will be implemented for hybrid builds in |
| Comment by Eric Milkie [ 31/Oct/18 ] |
|
Since this project is going to stop doing index table writes during the collection scan phase (instead, the data will be written to the external sorter), I don't think there is that much work to be done here. We can do the batching without worrying about handling write conflict exceptions any differently than we already do today. However, we cannot do this work until later in the project. |
| Comment by Louis Williams [ 30/Oct/18 ] |
|
After some testing locally, removing the save/restore code will speed up background builds by about 20%. Additionally batching scans+inserts into groups of 1000 brings that figure to about 100%. This is with an incomplete implementation. To support write conflicts in the middle of batches, the collection scans need to yield and be resumed at the first Record at the beginning of the failed batch. CollectionScans don't currently support that behavior. Currently I see a few solutions: 1. Buffer all intermediate collection scan results in memory until they are committed. We would also want to expose a "isGoingToYield()" method on the PlanExecutor that hints about an upcoming yield. In this way we can proactively commit an outstanding WriteUnitOfWork before a yield takes place and adhere to existing collection scan yielding rules. Write conflicts would just start inserting from the beginning of the buffer. milkie What do you think about these options? Is the performance gain here worth the work? |