[SERVER-73237] Collect PSI (Pressure Stall Information) in FTDC Created: 24/Jan/23  Updated: 30/Jan/23  Resolved: 30/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Miguel Angel Nieto Assignee: Backlog - Security Team
Resolution: Duplicate Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-45255 Capture Pressure Stall Information in... Closed
Assigned Teams:
Server Security
Participants:

 Description   

Hello,

Recent Kernel versions include PSI (Pressure Stall Information) that is very useful to understand the pressure on resources like CPU, Memory and Storage.

The psi feature identifies and quantifies the disruptions caused by such resource crunches and the time impact it has on complex workloads or even entire systems.

We only need to read `/proc/pressure/resource_name` where resource_name can be cpu, memory and storage.

This information would be really helpful when doing analysis for our customers, since it would give us a good metric on the resources pressure prior to an event that is being investigated.

In the future, once this is implemented, we could even graph it in our tools and Atlas interface, to understand if the clusters are close to the stall point or even use that information to alert our customers before the stall itself happens.

Let me know if you have any question.

Regards.



 Comments   
Comment by Miguel Angel Nieto [ 27/Jan/23 ]

Thinking about this, I guess that if we want other tools to use this information (like monitoring agent, or atlas interface) it would need to go to serverStatus.

Comment by Eric Sedor [ 26/Jan/23 ]

TBD whether this goes in serverStatus or just in one of the FTDC collectors

Generated at Thu Feb 08 06:24:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.