[SERVER-65162] Improve FTDC performance to avoid stalls Created: 01/Apr/22 Updated: 07/Nov/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jordi Olivares Provencio | Assignee: | Backlog - Security Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Server Security
|
||||||||
| Participants: | |||||||||
| Description |
|
While working on
During the test run we saw multiple entries in the mongod log that seem to point to the ftdc thread performing really poorly. We've attempted to set its thread scheduling priority to use SCHED_FIFO and the priority to the max value but that didn't seem to fix the issue.
Note that the numbers are milliseconds. The log seems to indicate that the following points in the serverStatus command are bottlenecks and could do with some performance improvements:
Additionally, after performing some preliminary searches it seems that the Linux system metrics collector is also a bottleneck at points due to the multiple syscalls needed to get the necessary information from the /proc filesystem. Attached to the ticket is a flamegraph of a perf recording of the ftdc thread during the benchmark run. The read() syscall can be seen as taking a relatively decent chunk of the processing time. |
| Comments |
| Comment by Connie Chen [ 25/Oct/22 ] |
|
Backlog Grooming Notes: Move this into a project and re-evaluate as a project-sized effort |
| Comment by Jordi Olivares Provencio [ 06/Apr/22 ] |
|
At that scale it is indeed impractical to work with all threads. I did some manual gdb samples (like 5 or so) and they pointed to the /proc reads and tcmalloc. I'll try to automate this process with gdb in order to better gauge where the bottlenecks might be. |
| Comment by Bruce Lucas (Inactive) [ 06/Apr/22 ] |
|
As mentioned, I suspect that a lot of the delay in FTDC may be due to mutex contention, not CPU contention, and therefore won't show up in the perf data, so if we only tackle the hot spots seen in perf data we may not make any improvement. The observation that increasing thread priority didn't help somewhat supports that. Because there are a large number of threads, using the normal gdb approach of "bt thread apply all" to get stack trace samples to see both on- and off-cpu time won't be practical. It's possible that SIGUSR2 would work although that may also be problematic because of the number of threads. However it may be possible to get useful stack trace samples by using gdb to sample just the ftdc thread, i.e. "bt thread apply [ftdcthread]". I think it might be worth a try to get samples this way, maybe 100 samples at intervals of a couple of seconds, to see what bottlenecks are impeding the ftdc thread, in order to decide what bottlenecks need to be tackled. |
| Comment by Bruce Lucas (Inactive) [ 01/Apr/22 ] |
|
I think much of this might be contending with other threads for access to data structures guarded by mutexes, and if so would not show up in the perf data. Also note that while the charts show with blank spaces for missing FTDC samples, the values computed still generally reflect correct averages over the intervals between the samples that are collected since generally FTDC metrics are cumulative numbers. You can often get a better visual representation by reducing the number of display samples in the t2 options menu. |
| Comment by Jordi Olivares Provencio [ 01/Apr/22 ] |
|
tcmalloc in particular seems to repeteadly call the same methods over and over with different arguments and the underlying implementation seems to be incredibly inefficient about it |
| Comment by Louis Williams [ 01/Apr/22 ] |
|
As far as the "globalLock" section goes, it is required to iterate through the Lockers of all active Clients, so that would explain the poor performance at high thread levels. |