[SERVER-57221] Inconsistent delete performance on collections with multiple secondary indexes on same key Created: 26/May/21 Updated: 29/Oct/23 Resolved: 03/Oct/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.14, 4.4.6, 5.0.0-rc7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Run the following test:
Output locally:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-06-28, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-02-07, QE 2022-02-21, QE 2022-01-24 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Delete performance appears to suffer greatly if there are multiple secondary indexes with the same first field in a key. This does not appear to have an obvious reason, because using a hint improves performance significantly. For example:
Investigate the reason for this discrepancy. Note: I tested this regression and it exists at least as far back as 4.2 |
| Comments |
| Comment by Louis Williams [ 03/Oct/22 ] | ||||||||||||||||||||||||
|
Thanks for the update, ivan.fefer@mongodb.com! I'm going to close this ticket. In summary, there were three different changes which combined to improved performance significantly:
| ||||||||||||||||||||||||
| Comment by Ivan Fefer [ 03/Oct/22 ] | ||||||||||||||||||||||||
|
louis.williams@mongodb.com
| ||||||||||||||||||||||||
| Comment by Ivan Fefer [ 27/Sep/22 ] | ||||||||||||||||||||||||
|
I am working on https://jira.mongodb.org/browse/SERVER-58712 and it looks like the same issue. I have run jstest from this ticket on my branch also: now 2 indexes times with hint and without hint are the similar. | ||||||||||||||||||||||||
| Comment by Luke Pearson [ 15/Sep/22 ] | ||||||||||||||||||||||||
|
Adding an update here: The storage bounded cursor work and associated server change has been implemented so I don't think there is a remaining dependency on Storage. | ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 25/Jul/22 ] | ||||||||||||||||||||||||
|
TL;DR: There are significant perf improvements for this problem ready for release (but not part of 6.0). We haven't closed this ticket as we expect other work will land soon that further improves the behavior. At its essence, this perf problem is due to quadratic cost in walking over data the workload is deleting. As an example, if we have 4 documents with `{a: "value"}`, there are 4 index entries:
When we want to delete all {a: "value"} documents, we search for "value" and position our cursor on the first (not deleted) record:
Then we delete that in a single transaction:
Now the query planner wants to re-evaluate the best index to use to continue the multi-delete. This results in performing another search for "value". Internally, WT binary searches for the first instance of "value" (now deleted) and then walks forward one-by-one until it reaches a non-deleted entry in its table:
So for the second document to be deleted, we walk over one "tombstone". For the third document to be deleted, we walk over two "tombstones". The total number of tombstones walked over to complete a single multi-delete is quadratic. The efficient multi-deletes project had a large impact on this multi-delete behavior because we now batch up more deletes into a single transaction. And therefore do less query planning operations that do work walking over tombstones. The behavior is still, strictly speaking, quadratic. But for a multi-delete of 300 documents grouped into batches of 10, we would now only skip over roughly (300/10)^2 tombstones instead of 300^2. For applications that are doing batch deletes (without a hint) on the order of hundreds to thousands of documents will see a dramatic improvement in 6.1+. Larger batch deletes may continue to struggle more than necessary.
| ||||||||||||||||||||||||
| Comment by Jen X [ 29/Nov/21 ] | ||||||||||||||||||||||||
|
Hi @Daniel Gottlieb Thanks for your replay, I'd like to add a bit more information: When restoring the cursor for the winning plan, it first positions the cursor at the deleted key, then moves cursor the next to the next record, which is found right away. On the other hand, a lot of extra (and unnecessary) work is done for the rejected plan:
| ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 23/Nov/21 ] | ||||||||||||||||||||||||
|
Hi jenny.ren@mail.utoronto.ca, Thanks for taking a look at this. I've only been watching this ticket, but I believe what you're suggesting here ties all the pieces together into a nice explanation. That explanation being, having some unnecessary cursors simply open on failed plans isn't strictly the problem. It's that we're paying the (also unnecessary) cost to reposition them after every delete + restore. james.wahlin, is this hypothesis enough to turn into a workitem for an upcoming sprint? I think you expressed some concern over whether it's ok to delete a losing plan. But with the current findings it does seem like cleaning up the execution path could be a big performance win. Thanks again jenny.ren@mail.utoronto.ca on your observation. This was a very much welcome contribution! | ||||||||||||||||||||||||
| Comment by Jen X [ 21/Nov/21 ] | ||||||||||||||||||||||||
|
Hi, I did some investigation on this case, and would like to share my findings: It looks like restoreState is called after every record is deleted, and this restores the cursor for all the stages, including those from the rejected query plan. Every time mongo restores the cursor for one of the stages in the rejected query plan in "_wt_btcur_next", it skips a large number of deleted records, which causes "_wt_page_dirty_and_evict_soon" to be invoked to mark the page for eviction later. As a result, eviction is triggered during saveState, during which "_wt_hs_delete_key_from_ts" is called to locate a keys and delete them, but the search spends a massive amount time in the "wt_btcur_next" function, triggering "_wt_tree_walk" more than 50k times in the slow reproduction. It appears that it was not able to find what it was looking and just kept on looping. Please let me know if can I offer more clarification. | ||||||||||||||||||||||||
| Comment by Luke Pearson [ 06/Jul/21 ] | ||||||||||||||||||||||||
|
yuhong.zhang I spent a bit longer looking at this today with the reproduce running locally, I'm still unsure of exactly what is happening here, but hopefully can offer some direction. I found that in the slow case we would perform twice the number of seekWTCursor and every second seek would end up skipping a large number of globally visible tombstones and regular tombstones. Additionally it wouldn't find the exact record it was looking for, presumably because whatever it was looking for was deleted. Its hard to explain why only every second seek has the negative performance, is it searching for a different record? Or using a different cursor / transaction? I'm less confident that eviction is the reason we skip less records in the fast case. | ||||||||||||||||||||||||
| Comment by Yuhong Zhang [ 06/Jul/21 ] | ||||||||||||||||||||||||
|
I am trying out preliminary implementations with saveState() on the losing plans' query stages to reset the wiredtiger cursors so that they will not stop the deletes from being globally visible to remove the records completely. However, starting from James' quick fix, I found some really strange behaviors. With the changes of
Although the resize() is supposed to destroy the losing plan's resources, it seems that the query stages created in the planner were not destructed until the deletes finished. Moreover, since for our case the multi-planner would always pick the first index as the winning plan, commenting out the last two lines will make it nothing but a self assignment compared to the original slow case. But it ran as fast as the other two just with the move and assignment (and of course the losing plan is not destroyed until the very end in this case too). I would like to pass this on to query execution to:
| ||||||||||||||||||||||||
| Comment by Yuhong Zhang [ 25/Jun/21 ] | ||||||||||||||||||||||||
|
Many thanks to luke.pearson for pointing me to many important code and explaining what's going on! I would also like to add to the "bytesWritten" question that both conditions of exceeding WT_BTREE_DELETE_THRESHOLD and seeing a page full of deleted records have been met when doing the force eviction in our slow case. If changing the heuristics of the eviction, the wiredtiger team might want to take both scenarios into account.
It's also worth mentioning that after increasing the number of documents we're dealing with in the tests, the other two faster cases also report the metric "bytesWritten" but because of forced eviction conditions other than "too many deleted items count" ("dirty count" and "dirty time"). Since the significant performance discrepancy still exists when all three tests write to the disk, I agree that "bytesWritten" is just a side discussion for the whole issue. | ||||||||||||||||||||||||
| Comment by Luke Pearson [ 25/Jun/21 ] | ||||||||||||||||||||||||
|
I had a chat with yuhong.zhang today and will try and clarify some of the open questions.
The slowness seen appears to stem from skipping records that aren't visible to the session performing the search. The cursors that are left open are keeping the "pinned_id" back in WiredTiger which is preventing the deletions from becoming globally visible, once they are globally visible they will actually remove content, until then they are just another update on the chain which will have to inspected by the search only to find that it can't see them. So I think we're skipping a bunch of deleted records, now this leads to a fairly solid explanation as to why the "bytesWritten" is only reported in the slow case. Given a large number of deleted records on a page it triggers forced eviction, and with non-globally visible deletes those will get written to disk as a stop time on the on disk value, so we'd see "bytesWritten" increase. Since durable history its arguable as to how useful it is to evict pages with a large number of non globally visible deletes but that isn't the cause of the slowdown here and probably a side discussion. yuhong.zhang and I spent some time removing the forced eviction path for deletes and still saw the same slowdown but no associated "bytesWritten" stat. I can think of two potential solutions here:
| ||||||||||||||||||||||||
| Comment by Louis Williams [ 11/Jun/21 ] | ||||||||||||||||||||||||
|
We will take over to answer the following questions:
| ||||||||||||||||||||||||
| Comment by Louis Williams [ 11/Jun/21 ] | ||||||||||||||||||||||||
|
james.wahlin, thanks so much for the investigation. I think this gives us (Execution) something to look at. We will try to understand whether it makes sense that holding open cursors for multi-planning causes this problem. | ||||||||||||||||||||||||
| Comment by James Wahlin [ 11/Jun/21 ] | ||||||||||||||||||||||||
|
daniel.gottlieb I believe that is the case. The cursors are not destructed and we do not take any explicit action in multi_plan.cpp that would trigger a "save" or "saveUnpositioned" on the underlying cursor. | ||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 10/Jun/21 ] | ||||||||||||||||||||||||
Are those cursors still "positioned"? I think in terms of the SortedDataInterface::Cursor that would mean:
| ||||||||||||||||||||||||
| Comment by James Wahlin [ 10/Jun/21 ] | ||||||||||||||||||||||||
|
This regression appears to be tied to the index scan cursors opened when we evaluate multiple eligible query plans. When there is more than one eligible indexed plan we will build an execution plan for each, and will perform a trial execution to determine which plan to use. In the case of a delete command the trial is read-only, using the query cost to determine a winner. Once a winner is chosen we then proceed with the actual delete under the winning plan. What we don't do is delete the losing plans or the associated IndexScan stages and resources. This means we maintain a WiredTiger cursor for each losing plan. My strong suspicion at this point is that maintaining cursors on indexes while performing deletes on the underlying collection incurs cursor maintenance costs. Deleting the losing plans after choosing a winning plan allows us to recoup most of the additional cost. Here is the result of the reproduction above when we do this on my local machine:
Here is the patch/hack used as applied to current master:
louis.williams / luke.pearson let me know how you would like to proceed. The next steps I see would be to:
louis.williams on a related note, I found that in the case we are multiplanning the slow operation log line will include a "storage.bytesWritten" for the delete. When we delete the additional plans, this metric does not appear. It maybe worth exploring? It is suspicious that for a delete operation that there are cases where we don't include.
| ||||||||||||||||||||||||
| Comment by James Wahlin [ 07/Jun/21 ] | ||||||||||||||||||||||||
|
I am still investigating this issue but have a few findings to share: Also worth mentioning luke.pearson is that when we do multiplan, we record a value for the "storage.bytesWritten" storage engine metric to the slow query logging. No value is produced for this when the index is hinted. | ||||||||||||||||||||||||
| Comment by Louis Williams [ 01/Jun/21 ] | ||||||||||||||||||||||||
|
Note: there is a similar unexplained regression with TTL deletes on descending indexes in | ||||||||||||||||||||||||
| Comment by Asya Kamsky [ 27/May/21 ] | ||||||||||||||||||||||||
|
Sounds like it's slow when query planning is involved? (single index candidate or hint are fast - neither involve planning). | ||||||||||||||||||||||||
| Comment by Luke Pearson [ 26/May/21 ] | ||||||||||||||||||||||||
|
Some things I found interesting:
I've pasted the FTDC that I am getting this information from. I agree with louis.williams that I would like to understand the difference between the two queries here so I'll re-assign back to the query team. In the meantime I'll raise this with the Storage Engines team. | ||||||||||||||||||||||||
| Comment by Louis Williams [ 26/May/21 ] | ||||||||||||||||||||||||
|
I did some minimal profiling. This is what I see for the slow case: And the fast case:
While this could be a storage engine issue, it would be useful to first understand what is different about these two types of queries. | ||||||||||||||||||||||||
| Comment by Louis Williams [ 26/May/21 ] | ||||||||||||||||||||||||
|
I agree that this appears to be a query-related problem. Reassigning to Query Optimization to take a look. Please let me know if this is not the right team or if you feel this is not a query issue and Storage Execution should instead investigate further. | ||||||||||||||||||||||||
| Comment by Eric Milkie [ 26/May/21 ] | ||||||||||||||||||||||||
|
Even more mysterious is that I get the same performance no matter what I hint; "a_1_b_1", "a_1_c_1", "_id_", and { $natural: 1 } as hints all perform the same. It only gets slower when we actually use no hint and perform planning by racing the plans. In my attempt, the query system races a_1_b_1 and a_1_c_1 against each other, and a_1_b_1 wins. I wonder if something in the act of query planning is causing the slowness here. | ||||||||||||||||||||||||
| Comment by Louis Williams [ 26/May/21 ] | ||||||||||||||||||||||||
|
I suspect this might be related to |