[SERVER-33191] Cache-full hangs on 3.6 Created: 08/Feb/18 Updated: 29/Oct/23 Resolved: 27/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 3.6.2, 3.6.5 |
| Fix Version/s: | 3.6.7, 4.0.1, 4.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Benety Goh |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Storage NYC 2018-07-16, Storage NYC 2018-07-30 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We have seen a couple of cases in the field where under heavy cache pressure a node can get stuck with the WT cache full. In the cases seen so far the cache pressure has been due to replica set lag, which causes the majority commit point to lag, creating cache pressure. So far we don't have a reproducer and don't have FTDC data covering the inception of the issue ( |
| Comments |
| Comment by Benety Goh [ 27/Jul/18 ] |
|
Reproduction script shows that the stuck cache issue was resolved between 3.6.5 and 3.6.6. There is a reported performance regression reported in |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit f5035fab356299fe6cffc783dbe9ccfab99fe069) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit 88cea24944cf277759b74b18878a0b4c0d89686e) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit eaf551c3aa01eb80b48bc7601aa288eed464532c) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit f5035fab356299fe6cffc783dbe9ccfab99fe069) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit 88cea24944cf277759b74b18878a0b4c0d89686e) |
| Comment by Githook User [ 20/Jul/18 ] |
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: (cherry picked from commit eaf551c3aa01eb80b48bc7601aa288eed464532c) |
| Comment by Githook User [ 19/Jul/18 ] |
|
Author: {'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}Message: |
| Comment by Githook User [ 10/Jul/18 ] |
|
Author: {'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}Message: |
| Comment by Githook User [ 10/Jul/18 ] |
|
Author: {'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}Message: |
| Comment by Alexander Gorrod [ 31/May/18 ] |
|
Thanks for the reproducer bruce.lucas I've run it and can see the same behavior. We will investigate things we can do to mitigate the issue. |
| Comment by Bruce Lucas (Inactive) [ 30/May/18 ] |
|
Attached script repro-10MBx2.sh sometimes gets stuck with the cache full. It
Attached file metrics.2018-05-30T17-31-46Z-00000 shows an instance of the above script running with the normal checkpoint interval.
We get stuck with the cache full at A, coinciding with a checkpoint. (We can tell we're stuck and haven't finished catching up because the lag keeps growing.) At A we also see some dhandle sweep activity, but I suspect it's the checkpoint that matters because increasing the checkpoint interval made the repro more reliable. I don't know if the checkpoint is integral to the issue, or whether its cpu activity simply opens a window for a race condition, since we only gave the process 2 cpus. Stack traces attached. We see two threads applying oplog operations (because we are able to parallelize applying the updates to the two documents), both in __wt_txn_commit, both waiting on a full cache. |
| Comment by Bruce Lucas (Inactive) [ 11/May/18 ] |
|
|
| Comment by Michael Cahill (Inactive) [ 19/Feb/18 ] |
|
bruce.lucas, you are correct that those stacks indicate that eviction threads are sleeping. If all four threads are in the same place, then they have each taken a turn at being the server, then given up and gone to sleep before trying again. I think we probably have the metrics we need to diagnose what's happening if we can get FTDC: a combination of the "WT Eviction Efficiency" view plus "eviction state" indicates (a) whether the eviction server is looking for pages at all, (b) what kind of pages the eviction server is trying to queue, (c) how many pages were queued and (d) how many of those pages could be evicted. If the cache is full, we would look at what proportion of the cache is dirty – e.g., one possibility is that the eviction server is stuck looking for dirty pages in a clean cache. Assuming these issues are not that straightforward, another consideration is whether the cache is clean (dirty % is very low), the eviction state indicates that the eviction server is looking for clean pages but not queuing any. The only reason not to queue clean leaf pages is if they have history that cannot be evicted. There was a change in 3.6.0 that was supposed to address this ( |
| Comment by Bruce Lucas (Inactive) [ 16/Feb/18 ] |
|
In all cases seen so far 1-4 of the 4 eviction threads were waiting with this stack: pthread_cond_timedwait@@GLIBC_2.3.2 which I believe is the eviction server sleeping because it didn't make progress. michael.cahill, agorrod, I wonder if we have the metrics in place that will help us understand why it looks like we're not finding pages to evict, and/or the lookaside table isn't helping, once |