[SERVER-52879] Periodic operation latency spikes every 5 minutes due to closing idle cached WT sessions Created: 16/Nov/20 Updated: 29/Oct/23 Resolved: 07/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.1, 4.2.10, 4.0.21, 3.6.21 |
| Fix Version/s: | 4.9.0, 4.2.12, 4.0.23, 4.4.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Gregory Wlodarek |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||
| Sprint: | Execution Team 2021-01-11 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
Test creates 1000 collections with 10 secondary indexes each (total 12 k tables) then does inserts at a rate of 1000 per second on 100 threads (10 per second for each thread) against randomly chosen collections. Latency is generally <1 ms but FTDC shows periodic latency spikes of tens of ms:
The spikes coincide with calls to WiredTigerSessionCache::closeExpiredIdleSessions, and the interval between spikes is wiredTigerSessionCloseIdleTimeSecs, which has been set to 30 in the above test. I believe this is what's happening:
This causes the pattern of cached cursor count and open sessions that we see in the FTDC above; with the 1-second sample interval of FTDC we only occasionally catch the number of sessions decreasing due to closeExpiredIdleSessions before immediately increasing again, e.g. at A in the chart above. The magnitude of this effect increases with more tables; perf data shows that closeExpiredIdleSessions spends about 65% of its time in close_cached_cursors and about 25% in __session_discard_dhandle, both of which are proportional to the number of tables. |
| Comments |
| Comment by Githook User [ 08/Jan/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77) |
| Comment by Githook User [ 08/Jan/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77) |
| Comment by Githook User [ 08/Jan/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit d517928d5b34e5988eea9fc15e1c2db4c3ff8d77) |
| Comment by Bruce Lucas (Inactive) [ 08/Jan/21 ] |
|
For better discoverability, I've updated the title of this ticket to include "every 5 minutes", based on the default value of wiredTigerSessionCloseIdleTimeSecs, to better describe the visible symptoms. |
| Comment by Githook User [ 07/Jan/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: |
| Comment by Bruce Lucas (Inactive) [ 16/Nov/20 ] |
|
Also it would be helpful to add a metric reporting the cumulative number of sessions from the session cache that are closed - that would have made it much easier to diagnose the original customer issue. |
| Comment by Bruce Lucas (Inactive) [ 16/Nov/20 ] |
|
Would it be possible to identify and remove the idle sessions from the cache under mutex, and then close those sessions outside of the mutex? |