[SERVER-48522] Regression after mass deletion Created: 01/Jun/20  Updated: 28/Apr/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0-rc7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Sid Mahajan
Resolution: Unresolved Votes: 0
Labels: KP44, dmd-escapes, perf-escapes, refinement
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2023-02-01 at 3.31.00 pm.png     PNG File Screen Shot 2023-02-01 at 3.31.12 pm.png     PNG File Screenshot 2021-03-29 at 15.18.34.png     PNG File Screenshot 2021-03-29 at 15.19.39.png     PNG File Screenshot 2021-03-30 at 16.25.27.png     PNG File Stats.png     PNG File comp-3.png     PNG File compare.png     PNG File focused.png     File metrics.2021-03-26T10-09-27Z-00000     File metrics.2021-03-26T21-18-07Z-00000     PNG File new_oplog_fetcher_exhaust_cursor.png     PNG File old_oplog_fetcher.png     PNG File query-mixed.png     PNG File regression.png     File regression.tgz     PNG File removals-mixed.png     PNG File remove.png     PNG File removes.png    
Issue Links:
Depends
Related
related to SERVER-59776 50% regression in single multi-update Closed
related to WT-10424 cursor::search_near slow performance ... Closed
related to WT-5769 Search history store can potentially ... Closed
related to WT-6349 Don't truncate history store updates ... Closed
related to SERVER-54939 Investigate secondary batching behavi... Closed
Operating System: ALL
Sprint: Storage - Ra 2022-01-24, Storage - Ra 2022-02-07, Storage - Ra 2022-02-21, Storage - Ra 2022-03-07, Storage - Ra 2022-03-21, Storage - Ra 2022-06-13
Participants:
Case:
Story Points: 0

 Description   

Test inserts 100 k documents, removes them all, then looks for (the now non-existent) documents via the _id index.

Left is 4.2.6, right is 4.4.0-rc7

  • A-B, D-E: removing 100 k documents. There is a regression of about 30% in rate of deletion ("cursor remove calls")
  • starting B, E: doing a find via the _id index for a non-existent document
  • B-C: in 4.2.6 for a period of time the queries are slow, then at C the rate picks up, apparently having something to do with pages evicted due to too many deleted items
  • E-F: in 4.4.0-rc7 the rate is also intially slow (albeit slower than in 4.2.6 at the same point), then picks up some at F, but remains much slower than in 4.2.6. Behavior of pages evicted due to too many deleted items is different - looks like we're trying but not able to evict them?

PMP shows most of our time here

#0  0x00005652ac555e4c in __wt_value_return_buf ()
#1  0x00005652ac5248b7 in __wt_txn_read ()
#2  0x00005652ac525bdf in __wt_btcur_next ()
#3  0x00005652ac53294a in __wt_btcur_search_near ()
#4  0x00005652ac484528 in __curfile_search_near ()
#5  0x00005652ac41695f in int mongo::wiredTigerPrepareConflictRetry<mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&)::{lambda()#1}>(mongo::OperationContext*, mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&)::{lambda()#1}&&) ()
#6  0x00005652ac417070 in mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&) ()
#7  0x00005652ac41d212 in mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seek(mongo::KeyString::Value const&, mongo::SortedDataInterface::Cursor::RequestedInfo) ()
#8  0x00005652ace06b22 in mongo::IndexScan::initIndexScan() ()
#9  0x00005652ace07817 in mongo::IndexScan::doWork(unsigned long*) ()

This one also showed up once

#0  0x00005652ac52331e in __wt_txn_upd_value_visible_all ()
#1  0x00005652ac526917 in __wt_btcur_next ()
#2  0x00005652ac53294a in __wt_btcur_search_near ()
#3  0x00005652ac484528 in __curfile_search_near ()
#4  0x00005652ac41695f in int mongo::wiredTigerPrepareConflictRetry<mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&)::{lambda()#1}>(mongo::OperationContext*, mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&)::{lambda()#1}&&) ()
#5  0x00005652ac417070 in mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seekWTCursor(mongo::KeyString::Value const&) ()
#6  0x00005652ac41d212 in mongo::(anonymous namespace)::WiredTigerIndexCursorBase::seek(mongo::KeyString::Value const&, mongo::SortedDataInterface::Cursor::RequestedInfo) ()
#7  0x00005652ace06b22 in mongo::IndexScan::initIndexScan() ()
#8  0x00005652ace07817 in mongo::IndexScan::doWork(unsigned long*) ()

Repro: 3-node replica set

function repro() {
 
    // insert 100 k documents
    nthreads = 10
    threads = []
    for (var t=0; t<nthreads; t++) {
        thread = new ScopedThread(function(t) {
            size = 10
            count = 10*1000;
            every = 1000
            const x = 'x'.repeat(size)
            for (var i=0; i<count; i+=every) {
                many = []
                for (var j=0; j<every; j++)
                    many.push({_id: t*count+i+j, x})
                db.c.insertMany(many)
                if (i%10000==0) print(t, i)
            }
        }, t)
        threads.push(thread)
        thread.start()
    }
    for (var t = 0; t < nthreads; t++)
        threads[t].join()
 
 
    // remove them
    print("removing")
    db.c.remove({_id: {$lt: 100000}})
 
    // loop trying to find non-existent document via _id index
    print("finding")
    while (true)
        db.c.find({_id: {$lt: 100000}}).itcount()
}



 Comments   
Comment by Sid Mahajan [ 01/Feb/23 ]

Fixing the root cause found for the WT-10424 issue also fixes this issue.

Summary of WT-10424 (detailed info on the ticket):

The ticket concerns the slow search_near performance that is caused by a high number of deleted items and large cache size. Lowering the cache size was found to greatly improve the performance, but further investigation uncovered that the pages with many deleted items were not being evicted. The issue was that these pages were not being added to the urgent eviction queue as they were already in the ordinary queues, however, the eviction threads found the ordinary queues empty. This led to the persistence of deleted pages in the cache, hindering the search_near performance.

To resolve the problem, the solution is to remove the relevant (non-candidate) pages from the queue, thereby improving the search_near performance.

FTDC analysis before and after WT-10424 changes:

Before:

After:

Thanks, etienne.petrel@mongodb.com for running the tests. The opcounters query rate has noticeably increased after the implementation of the changes in WT-10424. As the root cause is similar, this ticket will be marked as a duplicate of WT-10424. However, if any future regressions arise, feel free to re-open this ticket for further investigation.

Comment by Will Korteland [ 15/Mar/22 ]

> Why are you making the assumption that WT is not the cause?

I only assumed this for the find regression, since your earlier bisecting showed that came from MDB commit `06a03bc6ec`.

> Did you do this?

Yep, hence the comment that there's still a large regression here - the same 32% (roughly) that was originally observed, so it's definitely something to do with exhaust cursors.

> Back in Oct-2021 Alexander Gorrod moved this from replication to storage engines because:

I think there's some misunderstanding around what "this" is. My comment on March 8 refers solely to the find regression, but Alex moved the delete regression since that's definitely a WT issue (due to the addition of durable history).

Comment by Will Korteland [ 08/Mar/22 ]

An update on the read performance issue. Even with very recent MDB + WT, there's still a large regression here. Based on the above discussion, I started investigating with the assumption that WT was not the cause, so I went looking through MDB server code for what actually changes when exhaust cursors are turned on.

I don't know the server code at all, so I limited it to looking at all the places that mention exhaust cursors, but there were still a lot of these - I found three places where the performance hit might be happening.

The first is in bgsync.cpp, where there's an optimisation to avoid writing ops one at a time. I'm not sure if this applies for the regression we're looking at, but it's interesting that exhaust cursors are excluded. This might be fine, I'm just pointing it out since the motivation there isn't obvious to me:

 

    // Check some things periodically (whenever we run out of items in the current cursor batch).
    if (!oplogFetcherUsesExhaust && info.networkDocumentBytes > 0 &&
        info.networkDocumentBytes < kSmallBatchLimitBytes) {
        // On a very low latency network, if we don't wait a little, we'll be
        // getting ops to write almost one at a time.  This will both be expensive
        // for the upstream server as well as potentially defeating our parallel
        // application of batches on the secondary.
        //
        // The inference here is basically if the batch is really small, we are "caught up".
        sleepmillis(kSleepToAllowBatchingMillis);
    }

 

The second is in strategy.cpp, where we have some code that parses a reply and creates a response. Again, this might be really cheap but it's extra work that isn't being done for non-exhaust cursors. If I knew the server code better I might be able to exclude this.

 

if (OpMsg::isFlagSet(m, OpMsg::kExhaustSupported)) {
    auto responseObj = reply->getBodyBuilder().asTempObj();
    if (responseObj.getField("ok").trueValue()) {
        dbResponse.shouldRunAgainForExhaust = reply->shouldRunAgainForExhaust();
        dbResponse.nextInvocation = reply->getNextInvocation();
    }

Finally, service_state_machine.cpp has a `makeExhaustMessage` function that seems to do quite a lot of work, including `OpMsg::appendChecksum(&exhaustMessage)`, which is probably expensive for longer messages. It's unclear how often this is called though.

Comment by Will Korteland [ 28/Feb/22 ]

I'm picking this up after Monica and Alexey had to move on to something else, so here's my understanding of the situation to make sure I'm up to speed.

There are two perf regressions here, both happened at around the same time. One turned out to be MDB (the "find" regression), another turned out to be WT (the "remove" regression). The "find" regression went away when testing WT4.4 with MDB 4.2, but the remove regression didn't.

One thing that I don't think has been explained yet is that, for this workload, we kind of expect to see the "remove" regression. The WT version that MDB 4.4 uses introduced durable history, and we're hitting something of a pathological case here with the "insert then immediately remove" pattern. After durable history, we keep these records in the data store for a few seconds, so now when we search for the non-existent entry, we actually have to do some work - we can't "short circuit" the search like we did previously. (I think this "few seconds" might be changing to "30 seconds" in an upcoming WT version, but don't quote me on that).

That said, this is just an explanation of what's happening; it might have changed since WT-7757 where we did some general optimisations in this area. This is a somewhat strange workload and not something we expect to hit in the real world, but the oplog is a little bit special and is hitting it.

Ger bisected the "find" regression to the MDB commit that started using exhaust cursors - I suspect that one's out of WT's hands since it's asking us to do more stuff. Confirming that is the next step for me - I plan to take a look at exactly what changes when using the exhaust cursor.

Comment by Bruce Lucas (Inactive) [ 17/Feb/22 ]

monica.ng if I'm not mistaken ScopedThread was renamed to Thread somewhere along the way.

Comment by Monica Ng [ 17/Feb/22 ]

bruce.lucas dmitry.agranat The original reproducer uses the ScopedThread class but this is not supported in the MongoDB shell anymore, is there any equivalent class we can use instead? I have a reproducer script in Python but I'm not sure it is doing exactly the same thing.

Comment by Lingzhi Deng [ 09/Feb/22 ]

monica.ng I haven't conducted the same investigation here. But yes, all you need to do is to turn oplogFetcherUsesExhaust off.

Comment by Monica Ng [ 09/Feb/22 ]

lingzhi.deng I saw your comment here: https://jira.mongodb.org/browse/SERVER-48522?focusedCommentId=3700654&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-3700654 and wasn't sure if any investigation was conducted into this.

If not, I would like to go down the path of examining the performance trends without the use of exhaust cursors for oplog fetching so that I can potentially isolate any regressions related to WiredTiger specifically. Is there a way to disable the use of exhaust cursors for oplog fetching? Can it be turned off by setting the "oplogFetcherUsesExhaust" parameter? Is there anything else I should be aware of?

Comment by Monica Ng [ 07/Feb/22 ]

alexander.gorrod I've tried running the reproducer against the recent version of MongoDB which would include WiredTiger changes like WT-7757 and WT-6707. We thought these may help to improve performance around the query rate, but it doesn't seem like it does. From what I read above, it also looks like the reason for the regression could be coming from the MongoDB level, I am thinking of passing the ticket back unless you have other thoughts on other things we could look into?

Comment by Monica Ng [ 28/Jan/22 ]

I had a look into this and wanted to check if we were seeing similar trends in the metrics when running the reproducer against a newer version of MongoDB. I am still seeing a drop in the

ss opcounters query

metric although the value of the

ss wt cursor cursor remove calls

remains in the same range.

We noticed that the

ss wt cache checkpoint blocked page eviction

metric is considerably higher in v5.1.0. I will try to investigate whether reducing the concurrency of checkpoint and eviction will ease at least some of the performance regression.

Comment by Alexander Gorrod [ 13/Oct/21 ]

Storage Engines team is going to pick this up and run the reproducer against the latest version of WiredTiger. We did some work in WT-7757 to skip tombstones more affordably, which will hopefully significantly reduce the performance difference between releases.

If there is still an unreasonable performance regression, we will do some analysis to find the right owner.

Comment by Kelsey Schubert [ 14/Dec/20 ]

Discussing with Bruce, I think that this would be a good ticket for the performance team to pick up to write a regression test and track down the remaining regression.

Comment by Bruce Lucas (Inactive) [ 11/Jun/20 ]

Thanks sulabh.mahajan. I've confirmed that, using the same shell in both cases, rc9 and 4.2.6 now show the same performance on the finds.

However I'm still seeing a regression in the removes. However it looks like WT is being asked to do more work for the same number of documents removed. Left is 4.2.6 and right is 4.4.0-rc9:

"cursor remove calls" is the actual work being done, and we're doing it at a slower rate in 4.4.

But in addition to the actual work, we are doing a lot more cursor searches, wt transactions, wt set timestamps, etc., even though we are doing the actual work at a slower rate. As you suggest this means something has changed in the upper layers, but I'm not sure at this point whether it is a change in the integration layer or in the query layer, so I'll investigate some more before reassigning.

Comment by Sulabh Mahajan [ 10/Jun/20 ]

bruce.lucas, my conclusion is that the latest 4.4 (rc8) shows much lesser regression than the initially reported values. Seems like WT-6306 (and other work) has helped.
The remaining regression seems to be coming from layers above WiredTiger. I am passing the ticket back to you for the next steps.

Comment by Sulabh Mahajan [ 10/Jun/20 ]

Update:

  • I spent a lot of time looking at various t2 stats, and could not find anything that made sense wrt to slower query rates
  • I collected perf traces and compared between 4.2.6 and 4.4.0, and there did not seem to be a smoking gun. Wiredtiger also represented a very small portion of the CPU spent.
  • I thought maybe the bottleneck is above WiredTiger, so I decided to use WiredTiger from 4.2 with MongoDB v4.4 and got interesting results

426 = MongoDB 4.2.6
440 = MongoDB v4.4 (2a20c87a57)
440WT42 = MongoDB v4.4 (2a20c87a57) with WiredTiger MongoDB-4.2 (8f7554c852)

All three versions show similar remove calls, with a slightly higher peak for 4.4.0. This could be because of the short burst nature of the removals, and possibly

4.2.6 gets a consistent query rate of 2800 op/s
4.4.0 gets an initial query rate of 2800 op/s that falls to a consistent rate of 1900 op/s
4.4.0 with WT from 4.2 gets a consistent query rate 1900 op/s

I think there is something in MongoDB outside WiredTiger that might be restricting queries at 1900 op/s. I tried to look for reasons inside WiredTiger that could explain a drop from 2800 op/s to 1900 op/s for 4.4.0 but couldn't locate any.

Comment by Sulabh Mahajan [ 09/Jun/20 ]

dmitry.agranat, I have not identified any ticket in specific. WT-6306 is already in rc8, which is the one I think got us a large chunk of performance back. I am still looking into why we have slow queries in the latest 4.4.0 code.

Comment by Bruce Lucas (Inactive) [ 06/Jun/20 ]

I'm not sure - sounds promising, but I don't think the fixes you mentioned are in an rc yet. Here's what I'm seeing on (left to right) 4.2.6, rc7, rc8:

So on rc8 the query rate is much improved but still considerably lower than 4.2.6, and the remove rate still shows the same regression. I'll check again on rc9.

Comment by Sulabh Mahajan [ 05/Jun/20 ]

bruce.lucas, this ticket reduces to focusing on why the query rate is lower in 4.4.0. Do you agree? Other parameters seem comparable to me.

Comment by Sulabh Mahajan [ 04/Jun/20 ]

I ran the reproducer script on a 3 node replica set, on:
4.2.6;
4.4.0 (v4.4 @ b7554bc8dc)
4.4.0-lat (v4.4 @ b7554bc8dc + develop @912c4844f)

The eviction state is much better, with reduced forced eviction because of the deleted count. This is likely fixed from WT-6306

The cursor remove calls appear to be comparable to 4.2.6

On zooming closer, here are some observations:

  • In 4.2.6 the query quickly ramps up to 2880 ops
  • In 4.4.0 it ramps up and then stays at roughly 1980 ops
  • With latest develop query runs at 2500+ ops then falls to 2000 ops
Comment by Bruce Lucas (Inactive) [ 02/Jun/20 ]

Couple of observations:

  • The regression does not manifest for standalone nodes, only for replica sets
  • Setting maxTargetSnapshotHistoryWindowInSeconds=0 eliminates the regression

So it appears this has something to do with the handling of timestamps.

Comment by Brian Lane [ 02/Jun/20 ]

Hey alexander.gorrod - yeah, I set it to 4.4.0 and we discussed in planning. We linked a couple of related issues wanting in first then will bring this one in. I put it at that top of the backlog.

Comment by Alexander Gorrod [ 02/Jun/20 ]

brian.lane did you see this ticket? We should get it prioritized when we can.

Comment by Bruce Lucas (Inactive) [ 01/Jun/20 ]

Thanks for the confirmation.

Comment by Louis Williams [ 01/Jun/20 ]

bruce.lucas, SERVER-40167 only applies to reads concurrent with prepared transactions. Every WT read operation is wrapped in a wiredTigerPrepareConflictRetry to retry in the face of prepare conflicts, so this function will show up in any stack trace involving a read.

Comment by Bruce Lucas (Inactive) [ 01/Jun/20 ]

louis.williams I noticed that SERVER-40167 is in the same general vicinity as it involves removal and index keys, and wiredTigerPrepareConflictRetry appears on the PMP stacks above. I think this is probably something different but can you take a look to rule out that the integration layer is generating extra work for the storage layer in 4.4?

Generated at Thu Feb 08 05:17:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.