[SERVER-50749] Re-loading is slow with py-tpcc Created: 03/Sep/20  Updated: 06/Dec/22  Resolved: 12/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mark Callaghan (Inactive) Assignee: Backlog - Storage Engines Team
Resolution: Duplicate Votes: 0
Labels: KP44, dmd-escapes, perf-escapes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-09-09 at 6.16.19 PM.png     PNG File Screen Shot 2020-09-10 at 9.35.27 AM.png     PNG File example.png     File flow0.mo429.tar.gz     File flow0.mo440.tar.gz     File flow1.mo429.tar.gz     File flow1.mo440.tar.gz     File ftdc.tpcc.440.hang.tar     File metrics.2020-09-09T20-02-23Z-00000     File mongod.log.440.hang.gz    
Issue Links:
Backports
Duplicate
duplicates SERVER-61185 Use prefix_search for unique index lo... Closed
Related
related to WT-6664 Cache eviction causes high latency du... Closed
is related to WT-6664 Cache eviction causes high latency du... Closed
Assigned Teams:
Storage Engines
Operating System: ALL
Backport Requested:
v5.1
Sprint: Storage - Ra 2021-09-20, Storage - Ra 2021-10-04
Participants:

 Description   

While loading data for py-tpcc flow control is engaged, the insert rate drops and a few inserts take 200 to 300 seconds.

This is from Percona. They previously gave us the repro for WT-6444 via py-tpcc. In their report the first load into database tpcc1 takes ~20 minutes with a new mongod instance. After sleeping a few minutes and then repeating the load into database tpcc3 the second load takes ~500 minutes. They used a single-node replica set and my repro attempts do the same.

Part of this is a duplicate of SERVER-46114 which was closed as works as designed. If you read all of the updates below, there is a chance that mongod gets stuck with flow control engaged, an insert statement that never finishes and mongod unable to shutdown. So I don't think works as designed is appropriate.

Summarizing what I see below in my repro attempts:

  • this problem is new in 4.4.0. I tried but could not reproduce this with 4.2.9.
  • many inserts take more than 5 seconds with 4.4.0 (up to 390 seconds ignoring the hang). No inserts take more than 5 seconds with 4.2.9
  • in one test mongod got stuck. An insert statement was saturating a CPU core but making no progress for 1+ hour. It did not stop after killOp(). Shutting down mongod via "killall mongod" did not stop mongod and eventually I did kill -9.
  • with flow control enabled and 4.4.0 there are stalls (inserts that take 10 to 390 seconds)
  • with flow control disabled and 4.4.0 there are still stalls, but they are not as bad (10 to 60 seconds) as above

I have ftdc and mongod error logs for most of the results listed below. I can provide them if requested. There are many, so I prefer to do that on demand.



 Comments   
Comment by Sulabh Mahajan [ 12/Nov/21 ]

I confirmed the problem goes away if we do not use the unique index. Actually current master performs better with tpcc load and reload than 4.2.17. Please see WT-6664, this comment, for more details.

Given there is nothing new for this server ticket, I am going to mark it dup of WT-6664. WT-6664 will likely be duped to another WiredTiger ticket where we are resolving these issues with unique indexes.

Comment by Sulabh Mahajan [ 12/Oct/21 ]

Adding the comment describing the current state of the issue from WT-6664:

Here is the summary of the current state of the things:

Version FlowControl Load 1 Load 2
4.2.17 ON 79m 33s 81m 41s
4.2.17 OFF 86m 21s 86m 35s
Latest Master (79370f1ea8) ON Killed after 120m N/A
Latest Master (79370f1ea8) OFF 97m 59s Crash after 1539m 54s
Comment by Mark Callaghan (Inactive) [ 17/Sep/20 ]

Repeating a comment from wt-6664. Percona, the reporter of this problem, now has a blog post. As reported above, the problem for them doesn't happen or is much less likely with a smaller WiredTiger cache.

Comment by Keith Smith [ 11/Sep/20 ]

Based on mark.callaghan's comment that he's using a system with 60GB of RAM, I did a few more tests on a similar system using different size WT caches.  It looks like this problem gets worse with larger cache sizes.  See WT-6664 for details.

Comment by Geert Bosch [ 10/Sep/20 ]

Note that all of this worked before. The case with the cache stuck was not stuck due to flow control, but due to an issue with the history store. All time was spent in the history store, see the backtraces. That thread already was past the flow-control state (it had acquired the global lock) and should have run until the next interruption point, but never did. That's a WT issue. Now, I can see how in this case flow-control compounded a problem, and we can fix that, but it doesn't fix the underlying WT issue. Similarly, inserting 1 doc at a time wastes enough CPU in mongod that the load on the storage engine will drop, also avoiding the cache stuck issue. That doesn't seem a good solution in the long run. Why does mongod behave so much worse with 4.4 than with 4.2?

Comment by Keith Smith [ 10/Sep/20 ]

bruce.lucas: Your assumption is correct. "flowControl isLagged" was zero throughout my run with a small batch size. Likewise, the WT cache was much happier. It never entered aggressive eviction mode, never used application threads for eviction, and rarely had to do in-memory restoration on evicted pages (1-2/sec).  

 

Comment by Bruce Lucas (Inactive) [ 10/Sep/20 ]

keith.smith, the batch size is expected to affect both cache pinning and operation of flow control if it engages. I imagine that the beneficial effect in your test has been because of the former, and there will no longer be lag so flow control will no longer be engaging, but it may be worth checking that assumption in the ftdc by checking that the flow control isLagged metric remains 0.

Comment by Mark Callaghan (Inactive) [ 10/Sep/20 ]

bruce.lucas - I use a c3.8xlarge host setup by DSI (16 cores, 60G of RAM, 20k EBS IOPs), benchmark client shared the host and benchmark client uses 10 connections (and 8 threads or processes)

Comment by Mark Callaghan (Inactive) [ 10/Sep/20 ]

Added:

  • flow0.mo440.tar.gz, flow0.mo429.tar.gz - flow control disabled for 4.4.0 and 4.2.9
  • flow1.mo440.tar.gz, flow1.mo429.tar.gz - flow control enabled for 4.4.0 and 4.2.9

These include ftdc, mongod error log and test output. The directory structure is like:
flow0.mo440/v1, flow0.mo440.v2 -> v1 and v2 because I repeated tests

from "ls flow0.mo440/v1"

  • w100 -> w400 is the number of warehouses (from 100 up to 400 in some cases)
  • pypyyes -> the python script was run by pypy, not python
  • l10 -> 10 loader clients (connections)

r.w100.l10.pypyyes  r.w150.l10.pypyyes  r.w200.l10.pypyyes  r.w250.l10.pypyyes  r.w300.l10.pypyyes  r.w350.l10.pypyyes  r.w400.l10.pypyyes

from "ls flow1.mo440/v2/r.w100.l10.pypyyes"

  • h.* and f.* files are PMP output
  • you can ignore o.* files
  • mongod.log is mongod error log from test end

diagnostic.data  f.200903_210423  h.200903_204618  h.200903_211703  o.asz1  o.grep  o.lsh   o.sz1  op.1   op.12  op.15  op.18  op.20  op.5  op.8
f.200903_204618  f.200903_211703  h.200903_204811  h.200903_211944  o.asz2  o.io    o.lshr  o.sz2  op.10  op.13  op.16  op.19  op.3   op.6  op.9
f.200903_204811  f.200903_211944  h.200903_210423  mongod.log       o.dbs   o.ls    o.ps    o.vm   op.11  op.14  op.17  op.2   op.4   op.7

Comment by Keith Smith [ 10/Sep/20 ]

I reran my test with internalInsertMaxBatchSize=2.

The maximum latency during the test dropped from 33.5 to 1.4 seconds. So I'll go out on a limb and assert that this parameter affects the performance

I can experiment with other values, but I worry that the results would be dependent on workload and system. We'd find a value that works on a few tests here, but some customer will discover a workload where it blows up. Is there a principled way we can reason about the tradeoffs here?  Does this have to be a constant, or would it make sense to dynamically adjust the batch size based on the state of the system?

mark.callaghan: I'm already using pypy.

Comment by Bruce Lucas (Inactive) [ 10/Sep/20 ]

Thanks Mark. That does look quite a bit different from the data Keith posted (although it may well be simply a more extreme manifestation of the same underlying problems).

  • lag continues growing throughout the test and does not recover after the test, consistent with a hang
  • flow control is constantly engaged
  • flow control is actually doing significant throttling in this case because the rate of global lock acquisition before flow control was ~500/s, so throttling to the floor of 100/s is signficant

But given the hang that appears to create unrecoverable cache pressure, I'm not sure that the details of the behavior of flow control or the latencies aside from the hang in this particular test are too interesting as the hang is by far the worse problem and can be expected to cause the other problems.

In addition to the python difference you noted, your data shows a 4x larger server running mongod than Keith's data.

Comment by Mark Callaghan (Inactive) [ 10/Sep/20 ]

bruce.lucas Added the missing ftdc and error log files I mentioned here

keith.smith - please use pypy rather than python to run py-tpcc as that has much lower CPU overhead leading to more pressure on MongoDB

Next up is ftdc with and without flow control enabled

Comment by Bruce Lucas (Inactive) [ 10/Sep/20 ]

This example seems representative:

  • D: flow control engages, begins doing some throttling ("flowControl timeAcquiringMicros"), but doesn't really have much effect on insert rates ("opcounters insert", "cursor insert calls"). I think this is because of another known issue with flow control: throttling is based on global lock acquisition, but because of batching global locks are being acquired at a low rate, average 135/s ("locks Global acquireCount w"); but in order to avoid potential impact to internal operations flow control has a throttling floor of 100 lock acquisitions /s, so it's not going to do much throttling if the average without throttling is only 135/s.
  • E: cache pressure takes over as the throttling mechanism ("application thread time waiting for cache"). Note though that even though we've started throttling quite heavily (based on insert rates), lag is still increasing. I think this might be the effect I mentioned above. However here it doesn't really matter because at this point it is cache pressure that's doing the throttling, not flow control, so it doesn't really matter that flow control remains engaged.

In addition to helping in the way that Eric describes, decreasing the batch size will increase the rate of lock acquisition so should make flow control more effective if it does engage; tbd whether this is a good thing or not.

Another question is if flow control is not doing a lot, why is it having such a large impact on maximum latency? I seem to recall some concern about whether the o/s mechanism underlying flow control tickets is fair, or may even be intentionally unfair to favor throughput over latency. If unfair, maybe a thread could repeatedly get starved out trying to obtain a ticket?

Comment by Keith Smith [ 10/Sep/20 ]

milkie: It should be easy enough to re-run.  I'll try first with internalInsertMaxBatchSize set very low (1?), so we can test the hypothesis that this is causing the stalls.  If it is, then we can experiment with a better size.  

geert.bosch: Flow control and history store also seem implicated in my test. I was just focused on the WT cache when I selected the stats to look at.  See below:

 

Comment by Eric Milkie [ 10/Sep/20 ]

Alex, for your number 1 above, the current batch size for insert is "500" and this is run-time settable with the "internalInsertMaxBatchSize" server parameter, so we can experiment with different values to see their effect.
Can someone rerun the test with internalInsertMaxBatchSize set to something like 100? Running with five times the number of individual transactions as before shouldn't have a material effect on overall runtime (I surmise), but with open transactions five times smaller should enable WiredTiger to more efficiently evict pages as new data is inserted. Once we arrive at a better max batch size, we can consider changing the default in released MongoDB.

Comment by Alexander Gorrod [ 10/Sep/20 ]

I believe keith.smith reproduced this. We walked through the FTDC in detail today.

The issue that introduces latencies with this test is that it batches operations into large groups, and executes multiple of those batches in parallel. That results in MongoDB creating multiple large transactions running against WiredTiger at the same time. Sometimes one of those transactions falls behind which leads to the oldest timestamp and transaction ID being pinned. Since there is lots of content pinned by active transactions (the content WiredTiger still can't evict even with durable history), that leads to WiredTiger choosing inefficient pages to evict when trying to find enough space in cache to complete the lagging transaction. Our team did work to alleviate the long latencies seen in this workload coming into the 4.4.0 release, but that tuning either didn't catch enough cases or was undone by another change (which is possible - the heuristics we added to handle the case were tricky to get right).

The reason we split off a separate WiredTiger ticket is that the server could change its behavior to alleviate the symptom here as well. I can see two avenues that would be worth exploring:

1) It could implement a smarter version of insertBatchAndHandleErrors. At the moment that function tries to do all operations in a batch within a single WiredTiger transaction. If that fails it falls back to doing each operation in a separate transaction - WiredTiger tries very hard not to fail a transaction due to the amount of resources it holds. Reducing the batch size in that code would help ensure transactions didn't get big enough to cause the sort of pressure being put on the storage engine by the workload.
2) Fixing the issue that causes flow control to exacerbate the poor behavior. Since running with flow control endabled extends the maximum latency for an operation from 80 seconds to 390 seconds (if I'm reading the results correctly).

Comment by Geert Bosch [ 10/Sep/20 ]

Both flow-control activating (supposedly due to a lagging majority-committed point), and in the hanging case the stalling threads having backtraces in the history store seem to point to a scenario that may lead to specific operations needing access to the history store not making progress, resulting in a negative feedback loop. It doesn't look like your experiment reproduced that scenario.

Comment by Keith Smith [ 09/Sep/20 ]

I've been reproducing this with partial success using mark.callaghan's repro recipe.  He performs two runs of py-tpcc and observes the second take much longer.  I do not see this.  But like him I see a number of high latency insert operations.  It looks like WT cache eviction is a (the?) culprit here.

Looking at the FTDC data from my test, I added a pin at the time of every request that took longer than 16s.  That was 15 requests in about 25 minutes of run time.  These request align perfectly with spikes in various (mostly cache related) WT stats:

FWIW, I've been doing this on a VM with 16 GB of RAM and four cores.

I've opened WT-6664 for this. But since I haven't reproduced the original issue of the py-tpcc reload being so much slower I'm hesitant to say this is the entire cause for this problem.

Comment by Bruce Lucas (Inactive) [ 08/Sep/20 ]

mark.callaghan, I'm not seeing these two files on the ticket:

ftdc.tpcc.440.hang.tar - diagnostic.data directory
mongod.log.440.hang.gz - the mongod log file

Comment by Mark Callaghan (Inactive) [ 04/Sep/20 ]

Results similar to the previous comment, but with flow control disabled. Stalls were not as bad, but there were still many inserts that take more than 5 seconds with 4.4.0 and the worst case was ~75 seconds. No inserts take >= 5 seconds for 4.2.9.

Legend:

  • L1, L2 - number of seconds for the load
  • slow - number of slow queries with slowms=5000

flow control disabled

        4.4.0 --------- 4.4.0   4.2.9 --------- 4.2.9
w       L1      L2      slow    L1      L2      slow
100     203     193     12      186     189     0
150     306     339     33      278     274     0
200     413     377     37      365     373     0
250     486     555     49      461     458     0
300     615     626     60      560     560     0
350     734     673     47      652     663     0
401     799     808     74      749     727     0

Response time histogram for slow queries from the test with 4.4.0 and flow
control disabled. These are counted from the slow query log with slowms=5000.
Even with flow control disabled, stalls are still a problem for WiredTiger.

w       <=10s   <=20s   <=30s   <=40s   <=50s   <=60s   <=70s   <=80s
100     8       2       1       1
150     19      8       3       1               2
200     30      2       1       3       1
250     32      8       5       3                               1
300     42      8       3       4       1       2
350     30      6       5       6
400     54      11      5       1       3

Comment by Mark Callaghan (Inactive) [ 04/Sep/20 ]

I repeated tests for 100, 150, 200, 250, 300, 350 and 400 warehouses. For all of these the database was in memory – no reads from storage. Tests were done for 4.4.0 and 4.2.9 with and without flow control enabled.

First up is results with flow control enabled.

Legend:

  • L1, L2 - number of seconds for the load
  • slow - number of slow queries with slowms=5000
  • inf == mongod was stuck, killed after 1 hour

flow control enabled, v1. The X indicates that I was unable to count slow queries.

        4.4.0 --------- 4.4.0   4.2.9 --------- 4.2.9
w       L1      L2      slow    L1      L2      slow
100     184     444     X       180     187     0
150     297     621     X       275     269     0
200     540     591     X       368     365     0
250     775     653     X       459     466     0
300     711     711     X

flow control enabled, v2. The inf indicates that an insert statement and mongod got stuck as described above. I ended up having to do kill -9 to stop mongod.

        4.4.0 --------- 4.4.0   4.2.9 --------- 4.2.9
w       L1      L2      slow    L1      L2      slow
100     175     inf     X

flow control enabled, v3. There were no inserts that took >= 5 seconds with 4.2.9, but many with 4.4.0

        4.4.0 --------- 4.4.0   4.2.9 --------- 4.2.9
w       L1      L2      slow    L1      L2      slow
100     180     248      9      183     190     0
150     273     306      6      269     273     0
200     444     387      9      362     371     0
250     720     601     24      449     466     0
300     761     992     34      547     573     0
350     710     671      7      664     658     0
400     924     915     36      754     761     0

Response time histogram for slow queries from the test with 4.4.0 and flow
control enabled. These are counted from the slow query log with slowms=5000.
The problem of stalls from flow control is intermittent and in this run
w=250 and w=300 were the worst.

w       <=10s   <=20s   <=30s   <=40s   <=50s   <=60s   <=70s   <=80s
100      8                                                      1
150      1       4                      1
200      4       1      2               1               1
250     16       2      3       1                                       1 @ 150s, 1 @ 270s
300     19       9      3       1       1               1               1 @ 160s, 1 @ 180s, 1 @ 390s
350      2       2      3                                               1 @ 110s
400     12      12      4       3       1               7               2 @ 110s

Comment by Mark Callaghan (Inactive) [ 03/Sep/20 ]

Have yet to reproduce this in 4.2.9. It is intermittent with 4.4.0.

Generated at Thu Feb 08 05:23:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.