[SERVER-52879] Periodic operation latency spikes every 5 minutes due to closing idle cached WT sessions Created: 16/Nov/20  Updated: 29/Oct/23  Resolved: 07/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.1, 4.2.10, 4.0.21, 3.6.21
Fix Version/s: 4.9.0, 4.2.12, 4.0.23, 4.4.4

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File after.png     PNG File before.png     File repro.js     PNG File spikes.png    
Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Execution Team 2021-01-11
Participants:
Case:

 Description   

Test creates 1000 collections with 10 secondary indexes each (total 12 k tables) then does inserts at a rate of 1000 per second on 100 threads (10 per second for each thread) against randomly chosen collections. Latency is generally <1 ms but FTDC shows periodic latency spikes of tens of ms:

The spikes coincide with calls to WiredTigerSessionCache::closeExpiredIdleSessions, and the interval between spikes is wiredTigerSessionCloseIdleTimeSecs, which has been set to 30 in the above test. I believe this is what's happening:

  • closeExpiredIdleSessions takes a mutex on the WT session cache while closing idle sessions; this delays all operations.
  • Because of this up to 100 (number of client threads) operations queue up on the mutex and then proceed concurrently after closeExpiredIdleSessions completes, requiring up to 100 concurrent sessions to complete those operations; after completion those sessions are cached.
  • Between expiration intervals operations complete quickly so concurrency is low, so only a small number of the cached sessions are actually used.
  • So the next call to closeExpiredIdleSessions after the expiration interval finds that most of the cached sessions are idle, so it closes them while holding the session cache mutex, again delaying up to 100 operations and requiring them to open up to 100 new sessions after closeExpiredIdleSessions finishes closing the 100 old sessions, repeating the cycle.

This causes the pattern of cached cursor count and open sessions that we see in the FTDC above; with the 1-second sample interval of FTDC we only occasionally catch the number of sessions decreasing due to closeExpiredIdleSessions before immediately increasing again, e.g. at A in the chart above.

The magnitude of this effect increases with more tables; perf data shows that closeExpiredIdleSessions spends about 65% of its time in close_cached_cursors and about 25% in __session_discard_dhandle, both of which are proportional to the number of tables.



 Comments   
Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-52879 Close expired idle sessions outside of the cache mutex

(cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77)
Branch: v4.2
https://github.com/mongodb/mongo/commit/455bdbb8bcc2166a14fe009d28e71ee6bf4de59e

Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-52879 Close expired idle sessions outside of the cache mutex

(cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b01d1f1c14f1270e36adf06c78b6de9f63308936

Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-52879 Close expired idle sessions outside of the cache mutex

(cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77)
Branch: v4.4
https://github.com/mongodb/mongo/commit/d6a3d7b707df26dfdc64b1ffd7264fae2886ec50

Comment by Bruce Lucas (Inactive) [ 08/Jan/21 ]

For better discoverability, I've updated the title of this ticket to include "every 5 minutes", based on the default value of wiredTigerSessionCloseIdleTimeSecs, to better describe the visible symptoms.

Comment by Githook User [ 07/Jan/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-52879 Close expired idle sessions outside of the cache mutex
Branch: master
https://github.com/mongodb/mongo/commit/d517928d5b34e5988eea9fc15e1c2db4c3ff8d77

Comment by Bruce Lucas (Inactive) [ 16/Nov/20 ]

Also it would be helpful to add a metric reporting the cumulative number of sessions from the session cache that are closed - that would have made it much easier to diagnose the original customer issue.

Comment by Bruce Lucas (Inactive) [ 16/Nov/20 ]

Would it be possible to identify and remove the idle sessions from the cache under mutex, and then close those sessions outside of the mutex?

Generated at Thu Feb 08 05:29:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.