[SERVER-23001] Occasional 100% cache uses cripples server Created: 08/Mar/16  Updated: 18/Jun/16  Resolved: 28/Mar/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Paul Ridgway Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty

mongod -version
db version v3.0.9
git version: 20d60d3491908f1ae252fe452300de3978a040c7

AWS c4.4xlarge

1 x 6TB gp2 volume for mongo exclusively

No other significant processes or services running/consuming resource.


Attachments: PNG File cache.png     PNG File cache1.png     PNG File cache2.png     PNG File ram-and-cache.png    
Issue Links:
Duplicate
duplicates SERVER-22831 Low query rate with heavy cache press... Closed
duplicates SERVER-24580 Improve performance when WiredTiger c... Closed
Operating System: Linux
Participants:

 Description   

Apologies this may be a WT ticket?

When running WiredTiger in a 3 node replica set, after a period of time one or more of the nodes will consistently run above 80% cache used and grow until it hits 100% at which point it will struggle to serve reads (or writes when master) in a timely fashion.

Usually a restart and or stepDown helps.

So far adding more eviction threads or reducing the eviction trigger and target does not help, nor does providing more RAM/cache.

Before this happens mongo uses ~45-50% ram, once the issue presents ram goes over 50% - anecdotally sometimes up to 80%.

The issue seems to occur randomly after days or more of apparent stability.

There is no apparent stress on the VM. More RAM did not help, more CPU seems to have reduced the rate of ocurrances (was on an m4.2xlarge). Before running on the current instances there was a lot of system CPU.

Running on 3 x AWS c4.4xlarge, 16 Cores, 60 GB Ram.

Mongostat output from the affected slave (during the issue):

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn   set repl     time
   *50   446  *1720     *3       4   274|0     0.3  100.0       0 17.4G 15.8G   0|0  1|16   85k     5m  681 floow  SEC 09:13:32
   *52   323  *3099    *91       1   113|0     0.5  100.0       0 17.4G 15.8G   0|3  27|0   60k     7m  681 floow  SEC 09:13:33
   *50   220  *2825     *6       4    75|0     0.2   99.7       0 17.4G 15.8G   0|0   1|2   43k     8m  681 floow  SEC 09:13:34
   *35   221  *3417    *11       2   263|0     0.3  100.0       0 17.4G 15.8G   0|2  1|12   54k     2m  681 floow  SEC 09:13:35
   *18   432  *2013    *12       8   344|0     0.2  100.0       0 17.4G 15.8G   2|3  19|0   92k    16m  681 floow  SEC 09:13:36

IOStat:

03/08/2016 09:14:09 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.17    0.00    4.62    2.70    0.13   79.38
 
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdz              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
xvdb              0.00     0.00 1455.00    0.00 16260.00     0.00    22.35     1.06    0.74    0.74    0.00   0.16  22.80

iotop:

Total DISK READ :      13.69 M/s | Total DISK WRITE :      41.78 M/s
Actual DISK READ:      13.72 M/s | Actual DISK WRITE:      23.58 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 72806 be/4 mongodb     3.83 K/s    8.72 M/s  0.00 % 54.51 % mongod --config /etc/mongod.conf
   686 be/3 root        0.00 B/s  881.69 K/s  0.00 % 14.18 % [jbd2/xvdb-8]
129706 be/4 mongodb  1153.86 K/s   46.00 K/s  0.00 %  7.27 % mongod --config /etc/mongod.conf



 Comments   
Comment by Bruce Lucas (Inactive) [ 18/Jun/16 ]

Thanks Paul. I can confirm that this looks like an instance of the issue or issues we're investigating on SERVER-24580: when WT cache utilization reaches 95% application threads get involved in evictions, and we see large gaps in the data collection, confirming that application threads in general are seeing very high latencies.

Comment by Paul Ridgway [ 18/Jun/16 ]

I've got the diagnostics from the 3 nodes. The primary moved around a lot in the chaos and there was lots of restarting and resizing instances (none of which helped). I uploaded them using the Support File Upload tool - hope it helps!

Thanks

Comment by Ramon Fernandez Marina [ 17/Jun/16 ]

Thanks for the update paul@thefloow.com, sorry to hear 3.2 did not fix your use case. We're investigating performance issues when the WiredTiger cache fills up in SERVER-24580, feel free to tune to that ticket for further updates.

MongoDB 3.2 automatically collects diagnostic data in the diagnostic.data directory inside your dbpath. If you still have those files and could upload them to this ticket we could investigate whether this indeed matches SERVER-24580 or is a different problem. You can open a new ticket as well.

Thanks,
Ramón.

Comment by Paul Ridgway [ 17/Jun/16 ]

In a high volume prod env 3.2 was catastrophically worse and we had to roll back. We plan to update some slaves to get logs when we can. In testing it was fine but it was disastrous under the production workload ...

EDIT: I should add it was crippling every node, master and slave, within moments of a restart at times but not predictably. On 3.0 a step down and restart would cure a stressed node

Comment by Ramon Fernandez Marina [ 28/Mar/16 ]

paul.ridgway, SERVER-22831 will be part of 3.2.5. A 3.2.5-rc0 release candidate suitable for testing is currently scheduled for next week.

I am going to mark this ticket as a duplicate of SERVER-22831. If your testing with 3.2.5-rc0 does not address this issue please post back here so we can reopen and investigate further.

Thanks,
Ramón.

Comment by Paul Ridgway [ 22/Mar/16 ]

Thanks - I guess as a restart helps then there are pages suitable for eviction it just cant find them fast enough?

We are trialling 3.2 in dev currently so upgrade is certainly an option - do you know when it is likely to get backported?

Cheers

Comment by Bruce Lucas (Inactive) [ 22/Mar/16 ]

Thanks paul.ridgway. This looks similar to the previous incident:

  • starting about A eviction rates aren't able to keep up so cache usage grows above target levels.
  • "pages walked for eviction" is very high relative to pages evicted, and appears to have hit a ceiling at about A, so identifying pages to be evicted is the likely bottleneck. This is single-threaded, which explains why adding CPUs didn't address the problem.
  • as a result cache fills B-C and D-E, application threads get involved in eviction, and performance suffers: replication struggles ("repl apply batches totalMillis (/s)", and operations queue.
  • I couldn't find a specific trigger operation in the metrics, but I imagine this is a result of some (possibly small) change in cache pressure. It appears, based on eviction statistics, to possibly have been building for some time prior to A, at which point it hit the ceiling on pages walked for eviction.

We suspect this could be related to SERVER-22831, which is fixed in master and planned to be backported to 3.2. Would you be in a position to try 3.2 to see if it addresses this issue once that is available?

Thanks,
Bruce

Comment by Paul Ridgway [ 22/Mar/16 ]

Hi,

It happened again on the same node I had stepped down at ~6AM (as a slave now) at ~07:50. 100 cache use, 'slow' queries as a result, 59.8% system RAM usage.

Logs submitted (including ss.log) as node3-22mar2016.zip.

Appeared to recover following a restart.

EDIT:

We now log some mongo metrics with more granularity in our monitoring system, so attaching plots of cache and RAM usage - note node3 was primary, stepped down and node1 took over at 8am. Interestingly node2 is quite choppy on cache and is the only one using over 50% system RAM.

Attached WT Cache usage during the incident, also attached plot for 3 nodes, colour matched WT Cache (solid) to RAM (dotted). As a reminder these VMs have 30GB of RAM.

Thanks,

Paul

Comment by Paul Ridgway [ 22/Mar/16 ]

Hi,

It happened again this morning on the Primary, possibly as early as 2AM am and stopped at 6:08AM once stepped down.

Mongo system memory usage was up to 58.3% and cache use was solidly at 100% causing slow queries (which are not normally considered slow).

It seems the ss.log writer crashed at some point before hand but I have iostat

Logs submitted.

Thanks,

Paul

Comment by Paul Ridgway [ 19/Mar/16 ]

Hi Bruce,

I tried both before and after - tried a large range of min and max threads too - it all made zero impact.

Cheers,

Paul

Comment by Bruce Lucas (Inactive) [ 19/Mar/16 ]

Hi Paul,

The tool used to generate the graphs is an internal tool that we don't yet feel is in a condition to make public.

Thanks for the confirmation. One other question occurs to me: you mentioned that adding more eviction threads didn't help, but you also mentioned that adding CPU did help somewhat. Did you try increasing the eviction threads before or after you increased CPU? If it was before, would you be able to try more eviction threads now that you have more CPU?

Thanks,
Bruce

Comment by Paul Ridgway [ 19/Mar/16 ]

Hi Bruce,

Thanks for following up!

This incident was not exactly as we've seen before - as you say, there are some larger queries (aggregate) running.

On previous incidents we've stopped all but the core consumer services and seen no recovery.

Additionally, despite letting the aggregate queries keep running a restart was enough to recover the instance and have the cache site around 80%.

The memory did peak over 50% but not as bad as before.

As you note, the main issue still seems to be an eviction rate that cannot keep up despite plenty of resource being available (CPU, RAM, iops, disk throughput, etc).

Cheers,

Paul

P.S - is there a way for us to produce those plots or is it a private tool?

Comment by Bruce Lucas (Inactive) [ 18/Mar/16 ]

paul.ridgway, thanks for uploading that snapshot. here's a plot of some of the data from the end of the snapshot:

We can see starting at C, at about 2016-03-17 09:13:38Z, the cache usage begins to rise, and around D at about 2016-03-17 09:16:58.828Z it reaches 100%. At this point we see application threads begin to do evictions because of the 100% cache full condition, and this would have impacted performance. Can you confirm that this corresponds to events as you observed them?

Regarding your opening comment that mongod RAM usage sometimes goes as high as 80%, it doesn't appear that we are seeing that in this instance as RAM usage appears about as expected, with virtual memory around 19 GB; can you confirm?

The trigger in this case appears, based on coincidence in timing, to be a number of aggregation queries that started at C, around 2016-03-17 09:13:38Z. I imagine these queries were probably accessing data not in cache so were creating increased cache pressure; does this seem plausible to you?

Here are some statistics about cache churn just before and just after C:

                                  before C      after C      change
unmodified pages evicted (/s)     5,584         5,423        -161
modified pages evicted (/s)       2,928         2,774        -154
pages read into cache (/s)        8,519         8,781        +262

We see an increase in pages read into the cache, presumably a result of the aggregations; but we see a decrease, both in modified and unmodified pages evicted per second, for a net increase of about 500-600 pages per second in the cache, consistent with the cache growth we see during this period. What is not clear so far however is why evictions weren't to keep up to keep up in order to keep cache usage at target levels below 100%; we will continue to look into that.

Bruce

Comment by Paul Ridgway [ 17/Mar/16 ]

One slave did it this morning without a big query - I snapshotted the logs while it was happening and uploaded those (logs-17mar2016.zip).

Comment by Paul Ridgway [ 15/Mar/16 ]

I was running an unindexed query this morning which first time completed without drama in ~1000 seconds, second run (Same query) ran for an hour before I killed it, got the cache to 100% and mongo using 60% system RAM. After killing the query mongo eventually recovered. I'll submit the logs anyway though I'm not certain this is the same issue - even though the symptoms are identical we were not running large unindexed queries when it last happened...

I stopped the query no more than a minute or so before taking a copy of the ss.log and iostat.log files.

Comment by Paul Ridgway [ 08/Mar/16 ]

Thanks - I did wonder if 3.2 was better but looking at the release notes it was hard to know if WT was better or whether you back-ported critical/performance/engine fixes?

Having said that, in another cluster we run 3.2 and can make it thrash the cache but it was a different workload so its not necessarily comparable.

Comment by Ramon Fernandez Marina [ 08/Mar/16 ]

Understood, we'll keep this ticket open for the time being.

Note that MongoDB 3.2 includes a lot of improvements over 3.0, specially on WiredTiger, so if you have the ability to test the same scenario in a 3.2 replica set it would be useful to see if this particular issue has been addressed, and whether 3.2 will be a better fit for your use case.

Anyway, please let us know when you have data and have uploaded it, and we'll take a closer look.

Thanks,
Ramón.

Comment by Paul Ridgway [ 08/Mar/16 ]

HI ramon.fernandez,

Thank you for getting back to me so quickly. I'll set the logging running and revert back to you once the issue has presented again - it could take a week or so it seems.

Cheers,

Paul

Comment by Ramon Fernandez Marina [ 08/Mar/16 ]

paul.ridgway, in order to troubleshoot this issue we'll need some diagnostic information. I'd like to ask you to run the following commands from the command line on each of your nodes:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(10000)}" >ss.log &
iostat -k -t -x 10 >iostat.log &

These commands will create two files on each node, ss.log and iostat.log. Please start data collection now; when the issue appears again, let the data collection continue for a bit (as much as you can afford without impacting your production) and then upload the resulting files here (these files won't contain any sensitive information, and they'll only be available to MongoDB staff).

Thanks,
Ramón.

Generated at Thu Feb 08 04:02:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.