[SERVER-33191] Cache-full hangs on 3.6 Created: 08/Feb/18  Updated: 29/Oct/23  Resolved: 27/Jul/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.6.2, 3.6.5
Fix Version/s: 3.6.7, 4.0.1, 4.1.2

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Benety Goh
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File metrics.2018-05-30T17-31-46Z-00000     File repro-10MBx2.sh     Text File stacks.txt     PNG File stuck.png    
Issue Links:
Backports
Depends
depends on SERVER-32876 Don't stall ftdc due to WT cache full Closed
depends on WT-4105 Optimize cache usage for update workl... Closed
Related
related to SERVER-37849 Poor replication performance and cach... Backlog
related to SERVER-35103 Checkpoint creates unevictable clean ... Closed
related to SERVER-36221 [3.6] Performance regression on small... Closed
related to SERVER-36238 replica set startup fails in wt_cache... Closed
related to SERVER-36373 create test to fill WT cache during s... Closed
related to WT-4106 Optimize lookaside implementation to ... Closed
related to WT-4107 Reduce lookaside usage in update heav... Closed
is related to SERVER-34938 Secondary slowdown or hang due to con... Closed
is related to SERVER-34941 Add testing to cover cases where time... Closed
is related to SERVER-34942 Stuck with cache full during oplog re... Closed
is related to SERVER-36495 Cache pressure issues during recovery... Closed
is related to SERVER-36496 Cache pressure issues during oplog re... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Storage NYC 2018-07-16, Storage NYC 2018-07-30
Participants:
Case:

 Description   

We have seen a couple of cases in the field where under heavy cache pressure a node can get stuck with the WT cache full. In the cases seen so far the cache pressure has been due to replica set lag, which causes the majority commit point to lag, creating cache pressure.

So far we don't have a reproducer and don't have FTDC data covering the inception of the issue (SERVER-32876) so this ticket for now is a placeholder to gather information.



 Comments   
Comment by Benety Goh [ 27/Jul/18 ]

Reproduction script shows that the stuck cache issue was resolved between 3.6.5 and 3.6.6. There is a reported performance regression reported in SERVER-36221 that we will continue to investigate.

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 move wt_cache_full.js from jstests/replsets/ to jstests/noPassthrough/

(cherry picked from commit f5035fab356299fe6cffc783dbe9ccfab99fe069)
Branch: v3.6
https://github.com/mongodb/mongo/commit/13f9d42c1be08299c191eac13beb70b570843a48

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 add test to fill wiredtiger cache during steady state oplog replication

(cherry picked from commit 88cea24944cf277759b74b18878a0b4c0d89686e)
Branch: v3.6
https://github.com/mongodb/mongo/commit/258db0da725579e23ac41a9d6c3e0ac82a7d688f

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 guard mongo test program startup options against duplicate wiredtiger flags

(cherry picked from commit eaf551c3aa01eb80b48bc7601aa288eed464532c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/125de5067cf8042247f5bb14f7db2f36d36aab14

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 move wt_cache_full.js from jstests/replsets/ to jstests/noPassthrough/

(cherry picked from commit f5035fab356299fe6cffc783dbe9ccfab99fe069)
Branch: v4.0
https://github.com/mongodb/mongo/commit/f6a183706a304e0260678e1e754347fbc0f15793

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 add test to fill wiredtiger cache during steady state oplog replication

(cherry picked from commit 88cea24944cf277759b74b18878a0b4c0d89686e)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b1c9083d0e85f93d842aea6333622093ab0f4279

Comment by Githook User [ 20/Jul/18 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-33191 guard mongo test program startup options against duplicate wiredtiger flags

(cherry picked from commit eaf551c3aa01eb80b48bc7601aa288eed464532c)
Branch: v4.0
https://github.com/mongodb/mongo/commit/9bef62e3f0989e78c9e64c8484e50350522650e6

Comment by Githook User [ 19/Jul/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-33191 move wt_cache_full.js from jstests/replsets/ to jstests/noPassthrough/
Branch: master
https://github.com/mongodb/mongo/commit/f5035fab356299fe6cffc783dbe9ccfab99fe069

Comment by Githook User [ 10/Jul/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-33191 add test to fill wiredtiger cache during steady state oplog replication
Branch: master
https://github.com/mongodb/mongo/commit/88cea24944cf277759b74b18878a0b4c0d89686e

Comment by Githook User [ 10/Jul/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-33191 guard mongo test program startup options against duplicate wiredtiger flags
Branch: master
https://github.com/mongodb/mongo/commit/eaf551c3aa01eb80b48bc7601aa288eed464532c

Comment by Alexander Gorrod [ 31/May/18 ]

Thanks for the reproducer bruce.lucas I've run it and can see the same behavior. We will investigate things we can do to mitigate the issue.

Comment by Bruce Lucas (Inactive) [ 30/May/18 ]

Attached script repro-10MBx2.sh sometimes gets stuck with the cache full. It

  • starts a repl set using numactl to give each node 2 cpus; 1 GB cache
  • inserts two large (10 MB) documents
  • stops one of the nodes
  • does 1000 updates to each document
  • restarts the node so that it is executing a large batch of replicated updates to catch up; this generates two very large pages and a lot of cache pressure because each replicated batch is pinned in cache in its entirety in 3.6. Checkpoints interval is set to 1 second.

Attached file metrics.2018-05-30T17-31-46Z-00000 shows an instance of the above script running with the normal checkpoint interval.

We get stuck with the cache full at A, coinciding with a checkpoint. (We can tell we're stuck and haven't finished catching up because the lag keeps growing.) At A we also see some dhandle sweep activity, but I suspect it's the checkpoint that matters because increasing the checkpoint interval made the repro more reliable. I don't know if the checkpoint is integral to the issue, or whether its cpu activity simply opens a window for a race condition, since we only gave the process 2 cpus.

Stack traces attached. We see two threads applying oplog operations (because we are able to parallelize applying the updates to the two documents), both in __wt_txn_commit, both waiting on a full cache.

Comment by Bruce Lucas (Inactive) [ 11/May/18 ]

SERVER-34938, SERVER-34941, SERVER-34942 describe some ways that extreme cache pressure can be generated and in some cases appear to lead to a complete hangs. Those tickets deal with relieving the cause of those instances of cache pressure; I'd like to focus this ticket on why in cases of extreme cache pressure the lookaside mechanism doesn't prevent us from getting stuck.

Comment by Michael Cahill (Inactive) [ 19/Feb/18 ]

bruce.lucas, you are correct that those stacks indicate that eviction threads are sleeping. If all four threads are in the same place, then they have each taken a turn at being the server, then given up and gone to sleep before trying again.

I think we probably have the metrics we need to diagnose what's happening if we can get FTDC: a combination of the "WT Eviction Efficiency" view plus "eviction state" indicates (a) whether the eviction server is looking for pages at all, (b) what kind of pages the eviction server is trying to queue, (c) how many pages were queued and (d) how many of those pages could be evicted.

If the cache is full, we would look at what proportion of the cache is dirty – e.g., one possibility is that the eviction server is stuck looking for dirty pages in a clean cache. Assuming these issues are not that straightforward, another consideration is whether the cache is clean (dirty % is very low), the eviction state indicates that the eviction server is looking for clean pages but not queuing any. The only reason not to queue clean leaf pages is if they have history that cannot be evicted. There was a change in 3.6.0 that was supposed to address this (WT-3683), but perhaps it is not effective in these cases.

Comment by Bruce Lucas (Inactive) [ 16/Feb/18 ]

In all cases seen so far 1-4 of the 4 eviction threads were waiting with this stack:

pthread_cond_timedwait@@GLIBC_2.3.2
__wt_cond_wait_signal
__wt_cond_auto_wait_signal
__wt_cond_auto_wait
__wt_evict_thread_run

which I believe is the eviction server sleeping because it didn't make progress.

michael.cahill, agorrod, I wonder if we have the metrics in place that will help us understand why it looks like we're not finding pages to evict, and/or the lookaside table isn't helping, once SERVER-32876 is fixed?

Generated at Thu Feb 08 04:32:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.