[SERVER-26055] Server fills cache rapidly, even under replication only and then operates very slowing causing oplog backlog. When serving clients queries take ages to return once cache is 'full' Created: 10/Sep/16  Updated: 19/Nov/16  Resolved: 19/Nov/16

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Paul Ridgway Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File application-eviction-threads.png     PNG File graphite.thefloow.net.png    
Issue Links:
Duplicate
duplicates SERVER-25974 Application threads stall for extende... Closed
Operating System: ALL
Participants:

 Description   

We upgraded a secondary of a 3 node cluster to 3.2.9.

By default when we upgrade we use iptables to allow replication but block clients.

Upon allowing clients cache went up to ~96% and failed to drop. Only 1 (of 16) cores appeared to be in use.

Blocking clients, restarting and allowing replication caused the oplog to catch up but still over time the cache fills and the performance hits rock bottom.

mongostat

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn   set repl                 time
    *6    *0    *14     *1       0    14|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.23k   127k   16 floow  SEC 2016-09-10T20:18:27Z
    *1    *0    *21     *2       0    13|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.07k   127k   16 floow  SEC 2016-09-10T20:18:28Z
    *0    *0     *0     *0       0     9|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16  917b  93.2k   16 floow  SEC 2016-09-10T20:18:29Z
    *9    *0    *29     *1       0    12|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.01k   126k   16 floow  SEC 2016-09-10T20:18:30Z
    *2    *0     *4     *1       0    13|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.17k   126k   16 floow  SEC 2016-09-10T20:18:31Z
   *24    *0   *161    *10       0    14|0     1.3   96.0       0 21.9G 21.1G   0|0  0|15 1.13k   127k   16 floow  SEC 2016-09-10T20:18:32Z

iostat:

09/10/2016 08:18:51 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.51    0.00    0.06    0.13    0.00   93.30
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00   22.00    0.00    92.00     0.00     8.36     0.04    2.00    2.00    0.00   0.18   0.40
xvdh              0.00    99.00   21.00   17.00   260.00  1292.00    81.68     0.06    1.47    0.57    2.59   0.63   2.40
xvdz              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

replication status (command took 10-15min to return):

db-node2(mongod-3.2.9)[SECONDARY:floow] test>rs.printReplicationInfo()
configured oplog size:   614400MB
log length start to end: 958414secs (266.23hrs)
oplog first event time:  Tue Aug 30 2016 14:06:44 GMT+0000 (UTC)
oplog last event time:   Sat Sep 10 2016 16:20:18 GMT+0000 (UTC)
now:                     Sat Sep 10 2016 20:25:01 GMT+0000 (UTC)

Upon restart (which often takes ages) replication catches up but then the cache fills and the scenario repeats.

Note: Other nodes are running 3.0 still.

I also experimented with changing WT parameters with no joy.

We will downgrade but leaving at 3.2.9 with low priority for now to allow for diagnostics and logs if required.

With 3.0 we still have cache filling issues but they occur once or twice a month, with our workload mmap was pretty much maintenance-free (very stable, minimal issues [except the disk usage], 3.0 WT causes some pain but it's manageable, 3.2 WT is unusable.



 Comments   
Comment by Alexander Gorrod [ 28/Sep/16 ]

bigbourin@gmail.com you said:

Our secondary - which is just used for replication, dump every night and some slow queries (stats) - started responding slowly and got stuck at 100% cpu, after 5 days in production. So this time it took much longer, our dumps didn't kill it, but it still ended-up in the same situation.

That sounds like a different issue to the one discussed in this ticket. I recommend opening a new JIRA ticket with a description of your setup and workload and uploading MongoDB logs and diagnostic data from the machine where the issue occurs.

Comment by Adrien Jarthon [ 27/Sep/16 ]

Thanks for the detailed explanation Thomas,

I do understand why all these could cause big slow down to the server but I have a hard time understanding how can this kind of situation be unrecoverable ? in our case our server stayed for hours at 100% cpu with no requests and no IO usage (replication stuck at the same point in time). We let it like this to see if it would recover at some point as the load decreases (stopping our queries) but it never did and we had to restart. Can any of the issue you mentioned caused this, an infinite circle of death for the server?

And by after reading all this posts we downgraded to 3.2.5 to avoid the huge regression we had in 3.2.9, it was much better BUT, we still got the same problem today. Our secondary - which is just used for replication, dump every night and some slow queries (stats) - started responding slowly and got stuck at 100% cpu, after 5 days in production. So this time it took much longer, our dumps didn't kill it, but it still ended-up in the same situation.

We'll try the 3.2.10 of course but we're not sure if we'll be able to build a script to reproduce the take down of 3.2.5 so we can compare with 3.2.10. I'll have a look at the fixed bugs to maybe better understand how to replicate the issue quicker and maybe how to make sure we won't have the issue again with 3.2.10.

Thanks,
Adrien

Comment by Kelsey Schubert [ 27/Sep/16 ]

Hi bigbourin@gmail.com,

As you know, by default, when the WiredTiger cache usage goes over 95%, application threads (those performing operations) are involved in the eviction process. A number, or all, of the application threads begin looking for candidate pages to be evicted so that the cache usage can be reduced to its 80% target.

To go into a bit more detail about this process, there is a single single eviction server thread that walks the cache identifying potential candidate pages for eviction, and places them on a queue. At the same time, there are multiple eviction worker threads that take candidate pages from the queue and attempt to evict them, although for various reasons this may or may not succeed. When the cache utilization exceeds 95%, application threads also function as eviction worker threads and attempt to evict pages. This can be a source of performance problems if the eviction process is not efficient as the application threads are busy and cannot handle requests from clients.

There are three general improvements to eviction in WiredTiger that fix distinct issues around the eviction process. These improvements correct issues that are all observed when cache reaches 95% and application threads start evicting pages. As a result, these issues have very similar symptoms even though the underlying cause is different.

The first improvement was to make sure the eviction server was working as efficiently as possible to identify eviction candidates. Through MongoDB 3.2.9, this had been our focus and as a result of this effort, we resolved issues such as SERVER-24580.

The second improvement was to improve the likelihood that candidates found by the eviction server can be successfully evicted by eviction workers (including application threads doing eviction). MongoDB 3.2.10 includes fixes that improve some cases where hazard pointers block page eviction and we are confident that eviction is working more efficiently in general as a result of these code changes.

The third improvement requires a bit of background knowledge. The WiredTiger eviction server makes space available in cache by walking the set of pages currently in cache and selecting good candidates for eviction. The server concentrates those walks on parts of the cache where it has found good candidates in the recent past. Therefore, it is possible that when a new read load is introduced, the portion of the cache that holds good candidates will change. Subsequently, for a period of time after the change in cache content, the eviction server will spend more time looking for candidates in unproductive portions of the cache. Specifically, we believe that it is spending a lot of time walking trees with a lot of internal pages and very few leaf pages, since internal pages would be excluded from eviction at this stage. MongoDB 3.2.10 includes WT-2902 which causes the eviction server to spend less time looking for candidates in a section of the cache if it isn't finding enough.

I hope that this explanation clarifies the work that we have been putting into resolving this issue and explains some of diagnostic challenges we have faced, since many of the symptoms of these issues overlap and each workload stresses the system differently. To improve our diagnostic capabilities, we plan to include additional metrics, which will help to identify these types of issues more quickly, WT-2920.

In summary, MongoDB 3.2.10 will include a number of fixes, which I think will benefit your workload. Please feel free to review the relevant WiredTiger tickets for additional details: WT-2902, WT-2816/WT-2866, WT-2928, and WT-2924.

Kind regards,
Thomas

Comment by Adrien Jarthon [ 16/Sep/16 ]

Hi all, I just wanted to add that we recently upgraded our 3.2.9 replica from MMAP to WT and discovered this bug which almost took our platform down. Fortunately we managed to catch the problem on our secondary during a nightly dump which slowed the server to a crawl up to a point it couldn't catch-up with primary or execute a single command (with 100% CPU usage and 0% IO) until we restart the process. We searched for an explanation for hours before suspecting a bug in mongoDB itself and we finally looked a the bug tracker to find this was actually a regression in WT introduced after 3.2.5.

I'm happy to see that this issue is being working on and probably fixed in 3.2.10 but also sad that this regression could stay unfixed in a production release for months, we'll never be confident to upgrade MongoDB ☹.

We temporarily downgraded to 3.2.5 and it fixed the issue as expected, we expect to test 3.2.10 when out because we can't really leave our standard package management process. Is there any clear explanation of what the regression is, what has been done to fix it (is it just like 3.2.5?) and what has been done to prevent this in the future ?

Comment by Bartosz Debski [ 16/Sep/16 ]

Hi, So we have ran 3.2.10rc0 in production for more than a day. As WiredTiger cache was our main concern, below is a behaviour graph of WT Cache since 5PM yesterday. As Paul mentioned earlier this release looks much better than 3.2.9 but there is still some fluctuations on 3.2.10rc0 on size of the cache. This manageable so it's not a worry in a current state.

I hope this helps to confirm or analyse further any fixes.

Comment by Kelsey Schubert [ 14/Sep/16 ]

Hi paul.ridgway,

I'm glad to hear your testing is going well. The production release of MongoDB 3.2.10 should be out within a week.

Kind regards,
Thomas

Comment by Paul Ridgway [ 14/Sep/16 ]

Hi all,

So far this is looking good. We let replication alone run for a number of hours which worked well - during that time I pressured WT by dropping the eviction threshold all the way down to 30% in 10% increments and it all kept up fine.

Now we're back to defaults (80%) and allowing it to server some reads too, still looking good.

We'll open it up to all traffic tomorrow and test further.

Do you know the intended release date of 3.2.10 stable?

Cheers

Comment by Bartosz Debski [ 14/Sep/16 ]

Hi Thomas, thanks for this, I will apply RC release today and will update later on.

Comment by Kelsey Schubert [ 13/Sep/16 ]

Hi paul@thefloow.com and bartosz.debski,

I have completed my analysis of the diagnostic.data you have provided, and I believe that we are looking at the same issue as described in SERVER-26001.

When the cache hits 96%, we see that application threads begin evicting pages. At the same time there is a significant increase in the number of eviction calls finding an empty eviction queue. This indicates that the application threads are starved for work to do.

We have seen significant improvements to this behavior when using MongoDB 3.2.9 with a newer version of WiredTiger. MongoDB 3.2.10-rc0 includes these fixes to WiredTiger, and we expect that it will correct this behavior.

You can download MongoDB 3.2.10-rc0 to confirm that this issue is resolved. Please let me know if you have any questions.

Thank you,
Thomas

Comment by Daniel Pasette (Inactive) [ 13/Sep/16 ]

I understand and that makes perfect sense. We are cutting the RC today (hopefully ready this afternoon) and we plan to analyze your data today as well. I certainly wouldn't expect you to try out a release candidate in this case without good reason to expect it will address your issue.

Thanks again for your help.

Comment by Bartosz Debski [ 13/Sep/16 ]

Hi Dan, how sure you guys are that RC will address issues we are seeing ? I think there is a mutual benefit, us testing for you guys but as we can't replicate this issue on any other database we have, we are forced to test it on our Live environment as Paul mentioned. This makes us worry.

I will test this RC for you guys with oplog replication allowed only and hidden set to true. Providing that all will run OK for a day or two we can then enable some client traffic but ideally we would like some info on analysis of diagnostic data before we test this RC.

Comment by Paul Ridgway [ 13/Sep/16 ]

Thanks Dan. This is a production environment so I'm not sure (unfortunately our non-prod envs do not have the same traffic pattern), in any case I will discuss it with the ops team.

Comment by Daniel Pasette (Inactive) [ 13/Sep/16 ]

Thanks Paul. You can downgrade. We are preparing a release candidate tmrw which has a handful of fixes for some of the symptoms you describe. We will analyze the diagnostic data to check it fits the pattern. If so, would you be willing to test the release candidate with your workload?
Thanks again.

Comment by Paul Ridgway [ 12/Sep/16 ]

Done (logs-diag.7z)

If this is enough please let us know so we can downgrade and restore use

Comment by Kelsey Schubert [ 12/Sep/16 ]

Hi paul.ridgway

Please use this secure upload portal to upload the files.

Thank you for your help,
Thomas

Comment by Paul Ridgway [ 12/Sep/16 ]

Sure, can you provide an upload link?

Comment by Daniel Pasette (Inactive) [ 12/Sep/16 ]

Hi Paul,
Thanks for submitting this issue.
Could you provide us with the diagnostic data files and run logs from the node running 3.2?

Generated at Thu Feb 08 04:11:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.