[SERVER-16914] 15-second stall associated with "sched: RT throttling activated" under WiredTiger Created: 16/Jan/15  Updated: 21/Apr/15  Resolved: 01/Apr/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 2.8.0-rc5
Fix Version/s: 3.0.1

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Susan LoVerso
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2015-02-16 12.45.11.png     PNG File fixed.png     HTML File repro.html     HTML File throttle-pre9-10-gdbmon.html     HTML File throttle-pre9-10.html     PNG File throttle-pre9-10.png     PNG File throttle-pre9-15.png     PNG File throttled-15s.png     PNG File throttled.png    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
  • heavy mixed workload
  • Ubuntu 14.04.1 LTS, 3.13.0-32-generic
  • VMware, 6 cores

5-second pause in db ops was seen following this message in syslog:

Jan 16 14:39:15 ubuntu kernel: [20023.738805] [sched_delayed] sched: RT throttling activated

  • For a second or so from A to B no samples were reported by the external monitoring processes (a mongo shell process calling serverStatus and a python process monitoring system stats.)
  • An extremely high context switch rate was reported at A.
  • At B the monitoring processes resumed.
  • At around B (to within syslog resolution of 1 second) the above message appeared in syslog.
  • Starting at B for about 5 seconds db op rate dropped to 0
  • However not all activity was blocked: evictions appeared to be continuing, serverStatus was being processed.
  • "slots selected for switching that were unavailable" was high from B to C.

It appears that some behavior involving extreme CPU utilization at high priority from A to B, possibly involving context switches, caused the kernel to suspend those threads for 5 seconds. The threads involved with eviction apparently weren't suspended.



 Comments   
Comment by Bruce Lucas (Inactive) [ 01/Apr/15 ]

20-minute run of this workload (100 threads mixed ops on a 6-cpu virtual machine) showed no stalls under 3.0.1, so I'll declare victory and close this ticket.

Cache and heap were well-behaved. VM was maybe a bit bigger than might be expected. There was a checkpoint running for much of the 20-minute run.

Generated at Thu Feb 08 03:42:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.