[SERVER-57221] Inconsistent delete performance on collections with multiple secondary indexes on same key Created: 26/May/21  Updated: 29/Oct/23  Resolved: 03/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.14, 4.4.6, 5.0.0-rc7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-05-25 at 5.55.45 PM.png     PNG File Screen Shot 2021-05-25 at 5.56.37 PM.png     PNG File image-2021-05-27-09-45-29-911.png    
Issue Links:
Depends
depends on SERVER-58712 Inconsistent update performance on co... Closed
depends on SERVER-55750 Enable Feature flag for PM-2227 Closed
depends on SERVER-65528 Use range bounded cursors for restori... Closed
Related
related to SERVER-58712 Inconsistent update performance on co... Closed
related to SERVER-70184 Complete TODO listed in SERVER-57221 Blocked
related to WT-7619 Add a new optimization to skip pages ... Closed
is related to SERVER-56274 TTL deletes are much slower on descen... Closed
is related to SERVER-57307 Improve deletion of config.chunks ent... Closed
is related to SERVER-56509 Wrap unique index insertion _keyExist... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Run the following test:

/**
 * Reproduces inconsistent delete performance with secondary indexes.
 * Notably, deletes on collections with multiple secondary indexes have significantly
 * worse performance than deletes with one index or deletes that hint a specific index.
 */
(function() {
 
const time = (fn, desc) => {
    print("starting " + desc);
    let start = new Date();
    fn();
    let end = new Date();
    print(desc + ": " + (end - start) + "ms");
};
 
const key = 'value';
const bulkLoad = (coll) => {
    coll.drop();
    const DOCS = 10 * 1000;
    for (let i = 0; i < DOCS / 1000; i++) {
        let bulk = coll.initializeUnorderedBulkOp();
        for (let j = 0; j < 1000; j++) {
            let val = i * 1000 + j;
            // a is not unique, b is
            bulk.insert({a: key, b: val, c: val});
        }
        assert.commandWorked(bulk.execute());
    }
    print("inserted " + DOCS);
};
 
const testColl = db.coll;
(function remove1() {
    bulkLoad(testColl);
    assert.commandWorked(testColl.createIndex({a: 1, b: 1}));
 
    time(() => {
        assert.commandWorked(
            db.runCommand({delete: testColl.getName(), deletes: [{q: {a: key}, limit: 0}]}));
    }, "remove with 1 index");
})();
 
(function remove2() {
    bulkLoad(testColl);
    assert.commandWorked(testColl.createIndex({a: 1, b: 1}));
    assert.commandWorked(testColl.createIndex({a: 1, c: 1}));
 
    time(() => {
        assert.commandWorked(
            db.runCommand({delete: testColl.getName(), deletes: [{q: {a: key}, limit: 0}]}));
    }, "remove with 2 indexes");
})();
 
(function remove2Hint() {
    bulkLoad(testColl);
    assert.commandWorked(testColl.createIndex({a: 1, b: 1}));
    assert.commandWorked(testColl.createIndex({a: 1, c: 1}));
 
    time(() => {
        assert.commandWorked(db.runCommand(
            {delete: testColl.getName(), deletes: [{q: {a: key}, limit: 0, hint: 'a_1_b_1'}]}));
    }, "remove with 2 indexes and a hint");
})();
})();

Output locally:

inserted 10000
starting remove with 1 index
remove with 1 index: 176ms
inserted 10000
starting remove with 2 indexes
remove with 2 indexes: 3386ms
inserted 10000
starting remove with 2 indexes and a hint
remove with 2 indexes and a hint: 189ms

Sprint: Execution Team 2021-06-28, QE 2021-11-01, QE 2021-11-15, QE 2021-11-29, QE 2021-12-13, QE 2021-12-27, QE 2022-01-10, QE 2022-02-07, QE 2022-02-21, QE 2022-01-24
Participants:

 Description   

Delete performance appears to suffer greatly if there are multiple secondary indexes with the same first field in a key. This does not appear to have an obvious reason, because using a hint improves performance significantly.

For example:

  • Say collection has two secondary indexes {a: 1, b: 1} and {a: 1, c: 1} with 10K documents
  • Perform a delete on {a: <value>}
  • This query takes 5 seconds. If an index is dropped or a hint is used, the same query performance improves to around 200ms.
  • As additional indexes are added, performance degrades linearly

Investigate the reason for this discrepancy.

Note: I tested this regression and it exists at least as far back as 4.2



 Comments   
Comment by Louis Williams [ 03/Oct/22 ]

Thanks for the update, ivan.fefer@mongodb.com! I'm going to close this ticket.

In summary, there were three different changes which combined to improved performance significantly:

  • SERVER-65528: Using bounded cursors in WiredTiger to skip over deleted records after yielding
  • SERVER-55750: Enabling batched multi-delete behavior which efficiently batches deletes and reduces the amount of yielding. Dan's previous comment explains this pretty well.
  • SERVER-58712: Improving the multi-planner behavior to avoid performing work on rejected query plans
Comment by Ivan Fefer [ 03/Oct/22 ]

louis.williams@mongodb.com 
I believe now it is fixed after SERVER-58712.
When I run suggested jstest on current master I get same time with and without hint. Problem was in multi_plan_stage implementation.

[js_test:test] inserted 10000
[js_test:test] starting remove with 1 index
[js_test:test] remove with 1 index: 67ms
[js_test:test] inserted 10000
[js_test:test] starting remove with 2 indexes
[js_test:test] remove with 2 indexes: 85ms
[js_test:test] inserted 10000
[js_test:test] starting remove with 2 indexes and a hint
[js_test:test] remove with 2 indexes and a hint: 85ms 

Comment by Ivan Fefer [ 27/Sep/22 ]

I am working on https://jira.mongodb.org/browse/SERVER-58712 and it looks like the same issue.

I have run jstest from this ticket on my branch also: now 2 indexes times with hint and without hint are the similar.

Comment by Luke Pearson [ 15/Sep/22 ]

Adding an update here: The storage bounded cursor work and associated server change has been implemented so I don't think there is a remaining dependency on Storage.

Comment by Daniel Gottlieb (Inactive) [ 25/Jul/22 ]

TL;DR: There are significant perf improvements for this problem ready for release (but not part of 6.0). We haven't closed this ticket as we expect other work will land soon that further improves the behavior.

At its essence, this perf problem is due to quadratic cost in walking over data the workload is deleting. As an example, if we have 4 documents with `{a: "value"}`, there are 4 index entries:

| Indexed Value | RecordId | Index Key |
|---------------+----------+-----------|
| "value"       |        1 | "value1"  |
| "value"       |        2 | "value2"  |
| "value"       |        3 | "value3"  |
| "value"       |        4 | "value4"  |

When we want to delete all {a: "value"} documents, we search for "value" and position our cursor on the first (not deleted) record:

| Indexed Value | RecordId | Index Key |
|---------------+----------+-----------|
| -> "value"    |        1 | "value1"  |
| "value"       |        2 | "value2"  |
| "value"       |        3 | "value3"  |
| "value"       |        4 | "value4"  |

Then we delete that in a single transaction:

| Indexed Value        | RecordId | Index Key |
|----------------------+----------+-----------|
| -> "value" (deleted) |        1 | "value1"  |
| "value"              |        2 | "value2"  |
| "value"              |        3 | "value3"  |
| "value"              |        4 | "value4"  |

Now the query planner wants to re-evaluate the best index to use to continue the multi-delete. This results in performing another search for "value".

Internally, WT binary searches for the first instance of "value" (now deleted) and then walks forward one-by-one until it reaches a non-deleted entry in its table:

| Indexed Value        | RecordId | Index Key |
|----------------------+----------+-----------|
| "value" (deleted)    |        1 | "value1"  |
| -> "value"           |        2 | "value2"  |
| "value"              |        3 | "value3"  |
| "value"              |        4 | "value4"  |

So for the second document to be deleted, we walk over one "tombstone". For the third document to be deleted, we walk over two "tombstones". The total number of tombstones walked over to complete a single multi-delete is quadratic.

The efficient multi-deletes project had a large impact on this multi-delete behavior because we now batch up more deletes into a single transaction. And therefore do less query planning operations that do work walking over tombstones. The behavior is still, strictly speaking, quadratic. But for a multi-delete of 300 documents grouped into batches of 10, we would now only skip over roughly (300/10)^2 tombstones instead of 300^2. For applications that are doing batch deletes (without a hint) on the order of hundreds to thousands of documents will see a dramatic improvement in 6.1+. Larger batch deletes may continue to struggle more than necessary.

SERVER-65528 is an additional improvement to remove the intrinsically quadratic behavior by better expressing search bounds to avoid needing to walk over tombstones.

Comment by Jen X [ 29/Nov/21 ]

Hi @Daniel Gottlieb

Thanks for your replay, I'd like to add a bit more information:
The overhead of restoring the cursor is much much higher for the rejected plan than for the winning plan.

When restoring the cursor for the winning plan, it first positions the cursor at the deleted key, then moves cursor the next to the next record, which is found right away. On the other hand, a lot of extra (and unnecessary) work is done for the rejected plan:
Because the two different executions plans use different indices ("a_1_b_1", "a_1_c_1"), I'm guessing the orders at which they traverse the records are not exactly the same?
So for the rejected plan, many of the records next to the cursor appear already deleted in the page (page#1), this results extra work in the "__cursor_row_next" function, in which it keeps finding "WT_UPDATE_TOMBSTONE" updates, and keeps skipping to the next record (increasing the value held in the "total_skipped" variable eventually triggering eviction).
Even worse, often "__cursor_row_next" traverses the entire page only to find deleted records, so "__wt_btcur_next" invokes "__wt_tree_walk" to fetch the next page in the tree (page#2).
Eventually, it finds some valid record in the new page and returns; But because page#2 is regarded as a new page, the search in "__cursor_row_next" always starts from the beginning, which wastes extra iterations.
Now when the next "restoreState" call happens, this ping-ponging may repeat again,
because the cursor is set back into page#1.

 

Comment by Daniel Gottlieb (Inactive) [ 23/Nov/21 ]

Hi jenny.ren@mail.utoronto.ca,

Thanks for taking a look at this. I've only been watching this ticket, but I believe what you're suggesting here ties all the pieces together into a nice explanation. That explanation being, having some unnecessary cursors simply open on failed plans isn't strictly the problem. It's that we're paying the (also unnecessary) cost to reposition them after every delete + restore.

james.wahlin, is this hypothesis enough to turn into a workitem for an upcoming sprint? I think you expressed some concern over whether it's ok to delete a losing plan. But with the current findings it does seem like cleaning up the execution path could be a big performance win.

Thanks again jenny.ren@mail.utoronto.ca on your observation. This was a very much welcome contribution!

Comment by Jen X [ 21/Nov/21 ]

Hi, I did some investigation on this case, and would like to share my findings:

It looks like restoreState is called after every record is deleted, and this restores the cursor for all the stages, including those from the rejected query plan.

Every time mongo restores the cursor for one of the stages in the rejected query plan in "_wt_btcur_next", it skips a large number of deleted records, which causes "_wt_page_dirty_and_evict_soon" to be invoked to mark the page for eviction later.

As a result, eviction is triggered during saveState, during which "_wt_hs_delete_key_from_ts" is called to locate a keys and delete them, but the search spends a massive amount time in the "wt_btcur_next" function, triggering "_wt_tree_walk" more than 50k times in the slow reproduction. It appears that it was not able to find what it was looking and just kept on looping.

Please let me know if can I offer more clarification.

Comment by Luke Pearson [ 06/Jul/21 ]

yuhong.zhang I spent a bit longer looking at this today with the reproduce running locally, I'm still unsure of exactly what is happening here, but hopefully can offer some direction.

I found that in the slow case we would perform twice the number of seekWTCursor and every second seek would end up skipping a large number of globally visible tombstones and regular tombstones. Additionally it wouldn't find the exact record it was looking for, presumably because whatever it was looking for was deleted. Its hard to explain why only every second seek has the negative performance, is it searching for a different record? Or using a different cursor / transaction? I'm less confident that eviction is the reason we skip less records in the fast case.

Comment by Yuhong Zhang [ 06/Jul/21 ]

I am trying out preliminary implementations with saveState() on the losing plans' query stages to reset the wiredtiger cursors so that they will not stop the deletes from being globally visible to remove the records completely. However, starting from James' quick fix, I found some really strange behaviors. 

With the changes of 

+    _candidates[0] = std::move(_candidates[_bestPlanIdx]);
+    _bestPlanIdx = 0;
+    _candidates.resize(1);

Although the resize() is supposed to destroy the losing plan's resources, it seems that the query stages created in the planner were not destructed until the deletes finished. Moreover, since for our case the multi-planner would always pick the first index as the winning plan, commenting out the last two lines will make it nothing but a self assignment compared to the original slow case. But it ran as fast as the other two just with the move and assignment (and of course the losing plan is not destroyed until the very end in this case too). 

I would like to pass this on to query execution to:

  1. Confirm that the resize() somehow is not destroying the query stages in the losing plan and the move() itself is somehow bringing performance gains.
  2. If the above behavior is reproducible, share if there could be another reason why the second test runs much faster with the move().
  3. Confirm the possibility of using saveState() to reset the wt cursors instead of destroying the losing plan in case we are falling back to it. Our theory so far makes sense and the resources of the losing plans should not be left there anyways. So no matter what the real issue for this ticket is, it's worthwhile to take care of the resources in the losing plans.
Comment by Yuhong Zhang [ 25/Jun/21 ]

Many thanks to luke.pearson for pointing me to many important code and explaining what's going on! I would also like to add to the "bytesWritten" question that both conditions of exceeding WT_BTREE_DELETE_THRESHOLD and seeing a page full of deleted records have been met when doing the force eviction in our slow case. If changing the heuristics of the eviction, the wiredtiger team might want to take both scenarios into account.

 Yuhong Zhang and I spent some time removing the forced eviction path for deletes and still saw the same slowdown but no associated "bytesWritten" stat.

It's also worth mentioning that after increasing the number of documents we're dealing with in the tests, the other two faster cases also report the metric "bytesWritten" but because of forced eviction conditions other than "too many deleted items count" ("dirty count" and "dirty time"). Since the significant performance discrepancy still exists when all three tests write to the disk, I agree that "bytesWritten" is just a side discussion for the whole issue.

Comment by Luke Pearson [ 25/Jun/21 ]

I had a chat with yuhong.zhang today and will try and clarify some of the open questions.

Can we explain the poor performance when multiple cursors are open while deleting documents?

The slowness seen appears to stem from skipping records that aren't visible to the session performing the search. The cursors that are left open are keeping the "pinned_id" back in WiredTiger which is preventing the deletions from becoming globally visible, once they are globally visible they will actually remove content, until then they are just another update on the chain which will have to inspected by the search only to find that it can't see them.

So I think we're skipping a bunch of deleted records, now this leads to a fairly solid explanation as to why the "bytesWritten" is only reported in the slow case. Given a large number of deleted records on a page it triggers forced eviction, and with non-globally visible deletes those will get written to disk as a stop time on the on disk value, so we'd see "bytesWritten" increase.

Since durable history its arguable as to how useful it is to evict pages with a large number of non globally visible deletes but that isn't the cause of the slowdown here and probably a side discussion.

yuhong.zhang and I spent some time removing the forced eviction path for deletes and still saw the same slowdown but no associated "bytesWritten" stat.

I can think of two potential solutions here:

  • Reset the cursors that are pinning the content. I don't know how achievable that is in this situation.
  • Range limit the search that gets caught in the deleted content, using a similar feature to the "prefix_search" functionality.
Comment by Louis Williams [ 11/Jun/21 ]

We will take over to answer the following questions:

  • What explains why the "bytesWritten" statistic is only reported in the slow case?
  • Can we explain the poor performance when multiple cursors are open while deleting documents? Ultimately, I think we should come up with a WT ticket that summarizes this behavior, if necessary.
  • Does save()ing the unused cursors while deleting documents eliminate the regression?
  • When complete, assign back to QE with an implementation recommendation.
Comment by Louis Williams [ 11/Jun/21 ]

james.wahlin, thanks so much for the investigation. I think this gives us (Execution) something to look at. We will try to understand whether it makes sense that holding open cursors for multi-planning causes this problem.

Comment by James Wahlin [ 11/Jun/21 ]

daniel.gottlieb I believe that is the case. The cursors are not destructed and we do not take any explicit action in multi_plan.cpp that would trigger a "save" or "saveUnpositioned" on the underlying cursor.

Comment by Daniel Gottlieb (Inactive) [ 10/Jun/21 ]

(Storage/Storage Execution) Confirm that the cost of maintaining the additional index cursors during delete makes sense. On the surface it seems maybe excessive?

Are those cursors still "positioned"? I think in terms of the SortedDataInterface::Cursor that would mean:

  • the cursors on the losing plans are not yet destructed and
  • the cursors on the losing plans have not called "save" or "saveUnpositioned".
Comment by James Wahlin [ 10/Jun/21 ]

This regression appears to be tied to the index scan cursors opened when we evaluate multiple eligible query plans.

When there is more than one eligible indexed plan we will build an execution plan for each, and will perform a trial execution to determine which plan to use. In the case of a delete command the trial is read-only, using the query cost to determine a winner. Once a winner is chosen we then proceed with the actual delete under the winning plan.

What we don't do is delete the losing plans or the associated IndexScan stages and resources. This means we maintain a WiredTiger cursor for each losing plan. My strong suspicion at this point is that maintaining cursors on indexes while performing deletes on the underlying collection incurs cursor maintenance costs.

Deleting the losing plans after choosing a winning plan allows us to recoup most of the additional cost. Here is the result of the reproduction above when we do this on my local machine:

[js_test:repro] starting remove with 1 index
[js_test:repro] remove with 1 index: 439ms
[js_test:repro] starting remove with 2 indexes
[js_test:repro] remove with 2 indexes: 526ms
[js_test:repro] starting remove with 2 indexes and a hint
[js_test:repro] remove with 2 indexes and a hint: 485ms 

Here is the patch/hack used as applied to current master:

diff --git a/src/mongo/db/exec/multi_plan.cpp b/src/mongo/db/exec/multi_plan.cpp
index 19b4b64387..b8f5941282 100644
--- a/src/mongo/db/exec/multi_plan.cpp
+++ b/src/mongo/db/exec/multi_plan.cpp
@@ -194,6 +194,10 @@ Status MultiPlanStage::pickBestPlan(PlanYieldPolicy* yieldPolicy) {
 
     verify(_bestPlanIdx >= 0 && _bestPlanIdx < static_cast<int>(_candidates.size()));
 
+    _candidates[0] = std::move(_candidates[_bestPlanIdx]);
+    _bestPlanIdx = 0;
+    _candidates.resize(1);
+
     auto& bestCandidate = _candidates[_bestPlanIdx];
     const auto& alreadyProduced = bestCandidate.results;
     const auto& bestSolution = bestCandidate.solution; 

louis.williams / luke.pearson let me know how you would like to proceed. The next steps I see would be to:

  • (Storage/Storage Execution) Confirm that the cost of maintaining the additional index cursors during delete makes sense. On the surface it seems maybe excessive?
  • (Query Execution) Explore whether we can free resources tied to losing plans after choosing a winning plan. It won't be as simple as my patch above as we can potentially fallback to a losing plan in some scenarios.

louis.williams on a related note, I found that in the case we are multiplanning the slow operation log line will include a "storage.bytesWritten" for the delete. When we delete the additional plans, this metric does not appear. It maybe worth exploring? It is suspicious that for a delete operation that there are cases where we don't include.

 

 

Comment by James Wahlin [ 07/Jun/21 ]

I am still investigating this issue but have a few findings to share:
1) The high level difference in plan generation does appear to be that we are using the multiplanner when there are 2 potential indexes. The multiplanner will perform a trial of both plans before deciding which index to use to perform the delete. For the fast/hinted test, we do not generate a multiplan stage or perform a trial.
2) The time spent in multiplanning however is negligible. We both plan and perform deletes on the documents discovered during multiplanning very early on.
3) The rate that we delete documents is uniformly lower across the entire command run for the slower plan. This includes the post-multiplanning phase.

Also worth mentioning luke.pearson is that when we do multiplan, we record a value for the "storage.bytesWritten" storage engine metric to the slow query logging. No value is produced for this when the index is hinted.

Comment by Louis Williams [ 01/Jun/21 ]

Note: there is a similar unexplained regression with TTL deletes on descending indexes in SERVER-56274

Comment by Asya Kamsky [ 27/May/21 ]

Sounds like it's slow when query planning is involved? (single index candidate or hint are fast - neither involve planning).

Comment by Luke Pearson [ 26/May/21 ]

Some things I found interesting:

  • The hang corresponds with a signficant number of cursor->next calls originating from the search near. This feels similar to SERVER-56509 but doesn't involve any unique index insertions.
  • There are a lot of pages being written with txnids and timestamps suggesting we're evicting a lot of non-globally visible content. Interestingly we're not seeing the history call search calls we saw in WT-7264.

I've pasted the FTDC that I am getting this information from.

I agree with louis.williams that I would like to understand the difference between the two queries here so I'll re-assign back to the query team. In the meantime I'll raise this with the Storage Engines team.

Comment by Louis Williams [ 26/May/21 ]

I did some minimal profiling. This is what I see for the slow case:

And the fast case:

While this could be a storage engine issue, it would be useful to first understand what is different about these two types of queries.

Comment by Louis Williams [ 26/May/21 ]

I agree that this appears to be a query-related problem. Reassigning to Query Optimization to take a look. Please let me know if this is not the right team or if you feel this is not a query issue and Storage Execution should instead investigate further.

Comment by Eric Milkie [ 26/May/21 ]

Even more mysterious is that I get the same performance no matter what I hint; "a_1_b_1", "a_1_c_1", "_id_", and { $natural: 1 } as hints all perform the same. It only gets slower when we actually use no hint and perform planning by racing the plans. In my attempt, the query system races a_1_b_1 and a_1_c_1 against each other, and a_1_b_1 wins. I wonder if something in the act of query planning is causing the slowness here.

Comment by Louis Williams [ 26/May/21 ]

I suspect this might be related to SERVER-56274

Generated at Thu Feb 08 05:41:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.