[SERVER-45255] Capture Pressure Stall Information in FTDC for Linux hosts Created: 19/Dec/19  Updated: 29/Nov/23  Resolved: 28/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc6, 6.0.8

Type: Improvement Priority: Major - P3
Reporter: Kevin Arhelger Assignee: Adrian Gonzalez Montemayor
Resolution: Fixed Votes: 3
Labels: RDY, former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-73237 Collect PSI (Pressure Stall Informati... Closed
Problem/Incident
Related
related to SERVER-77459 Verify /proc/pressure/cpu is readable... Closed
related to SERVER-83261 Capture pressure stall information in... Closed
is related to SERVER-77998 Allow 'full' when reading from /proc/... Closed
Assigned Teams:
Server Security
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.3, v6.0
Sprint: Security 2023-05-01
Participants:
Case:

 Description   

In newer kernels (RHEL 8.1) system wide Pressure Stall information is available in /proc/pressure.

On systems that support it, this addition could be a valuable to more quickly spot system level issues.

https://www.kernel.org/doc/html/latest/accounting/psi.html



 Comments   
Comment by Githook User [ 28/Jun/23 ]

Author:

{'name': 'Adrian Gonzalez', 'email': 'adriangonzalezmontemayor@gmail.com', 'username': 'adriangzz'}

Message: SERVER-45255 Capture Pressure Stall Information in FTDC for Linux hosts
Branch: v7.0
https://github.com/mongodb/mongo/commit/216a67e8e0436d20df0c3a8aa404c7db0f7561fd

Comment by Githook User [ 22/Jun/23 ]

Author:

{'name': 'Adrian Gonzalez', 'email': 'adriangonzalezmontemayor@gmail.com', 'username': 'adriangzz'}

Message: SERVER-45255 Capture Pressure Stall Information in FTDC for Linux hosts

(cherry picked from commit 136235a05516e7f2d56dc4eefa3ffb1ee04dee5b)
Branch: v6.0
https://github.com/mongodb/mongo/commit/16892cef720e4ef0aa3ea719e7564e71ad27fd69

Comment by Rachelle Palmer [ 20/Jun/23 ]

Requesting backport for 6.0 series, thank you!

Comment by Githook User [ 28/Apr/23 ]

Author:

{'name': 'Adrian Gonzalez', 'email': 'adriangonzalezmontemayor@gmail.com', 'username': 'adriangzz'}

Message: SERVER-45255 Capture Pressure Stall Information in FTDC for Linux hosts
Branch: master
https://github.com/mongodb/mongo/commit/f3f504d1bfa734c91163ab2e072d4caa7e358412

Comment by Ger Hartnett [ 11/Jan/23 ]

Atlas Graviton is now running on AL2 with a kernel of 5.10+

Comment by Mark Benvenuto [ 21/May/21 ]

Pressure Stall Information is not available in Amazon Linux 2. AL2 uses 4.14 but PSI was added in 4.2.20.

Comment by Mark Benvenuto [ 16/Jan/20 ]

While RHEL 8.1 has PSI, it is not on by default. There is a kernel config setting CONFIG_PSI_DEFAULT_DISABLED. On RHEL 8.1, it is set to "y" which means PSI is disabled by default. In order to enable it, a customer has to edit their grub config.

References:
https://nanxiao.me/en/enable-pressure-stall-information-psi-on-void-linux/

Comment by Mark Benvenuto [ 15/Jan/20 ]

PSI support was added to Linux 4.20. The polling interface was added in 5.2. Redhat backported to their Linux 4.2.18 kernel in RHEL 8.1 as part of RHBZ# 1678388. Also, the only OS that we commercially supports that includes this is RHEL 8.1 (Ubuntu 18.04 is too old). The forthcoming Ubuntu 20.04 should have support for this though (they are testing on 5.4 in launchpad).

We can get up to 500ms window size accuracy by using the poll() interface. This is better than the 10sec granularity provided by default when a file is read.

If we decide to add support, we should use the poll interface() in a dedicated thread. I am not sure what thresholds to use (should we look for as low as 50ms stalls?). Our dedicated thread would then set counters to indicate that stalls occur, the type (cpu, memory, io) and the affect (some vs full).

In my ad-hoc testing, I could not get it working though on a RHEL 8.1 machine in EC2 I had upgraded from RHEL 8. I was getting Operation not supported on read and write to the files under /proc/pressure. I was able to successfully test it on Fedora 31 with 5.3.7 though.

References:
https://facebookmicrosites.github.io/psi/docs/overview
https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/

Generated at Thu Feb 08 05:08:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.