[SERVER-54472] Collect additional FTDC metrics during shutdown for replica sets Created: 11/Feb/21  Updated: 29/Oct/23  Resolved: 31/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File FTDC-1.png     PNG File FTDC-2.png     PNG File FTDC-3.png     PNG File ftdc.png     File metrics.2021-03-30T17-55-08Z-00000    
Issue Links:
Related
is related to SERVER-53708 Excess memory usage during shutdown i... Closed
is related to SERVER-48221 Shut down ftdc after storage engine Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2021-03-22, Execution Team 2021-04-05
Participants:

 Description   

In SERVER-48221 we enabled collection of FTDC metrics during WT shutdown; however we limited the metrics collected to system information outside of serverStatus.

There is some additional system-related information in serverStatus that would be useful to collect, such as tcmalloc and extra_info. This would allow us to diagnose CPU and memory problems during shutdown, and to monitor memory usage including shutdown in our performance tests.

It should be possible to collect this information since it is system-related and not (logically) dependent on MongoDB-specific data structures.



 Comments   
Comment by Githook User [ 31/Mar/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-54472 Collect additional FTDC metrics during shutdown for replica sets
Branch: master
https://github.com/mongodb/mongo/commit/ba8681015ac79e0b2f3fa1facdf9ff07c0282322

Comment by Gregory Wlodarek [ 30/Mar/21 ]

I've found the issue. I tested my changes against a standalone node only and not a replica set where there's a minor difference when shutting down. I'll use this ticket to fix that.

I attached sample metrics with the fix.

Comment by Bruce Lucas (Inactive) [ 25/Feb/21 ]

Maybe, but it looks to me like the presence of the ShutdownInProgress error as the (entire) content of serverStatus during this time suggests to me that the problem doesn't originate in WT per se.

Comment by Gregory Wlodarek [ 25/Feb/21 ]

bruce.lucas sorry for getting back to you late. This is an interesting case, I wonder if during the WT shutdown something is preventing serverStatus from running properly. I think this is worth investigating for us.

Comment by Bruce Lucas (Inactive) [ 19/Feb/21 ]

The content of serverStatus seen in FTDC during shutdown in the above case consists of just the following fields:

ok: 0
errmsg: "The server is in quiesce mode and will shut down"
code: 91
codeName: "ShutdownInProgress"
remainingQuiesceTimeMillis: 0

Comment by Bruce Lucas (Inactive) [ 18/Feb/21 ]

gregory.wlodarek if so then I may have misinterpreted what I was seeing in SERVER-53708. From iterations-08.tgz on that ticket:

A marks the point where we start shutting down WT. The system memory metrics (from the systemMetrics section of ftdc) shows the increasing memory that WT is using during shutdown, but the tcmalloc and resident memory stats (from serverStatus) are showing as 0.

This is a little different from your experiment - the delay here is in WT shutdown, whereas in your experiment you introduced a delay after shutdown. I wonder if it's possible that behavior during WT shutdown is what's problematic.

In case it matters, the above is on build 4.9.0-alpha-1214-g54e8728.

Comment by Gregory Wlodarek [ 18/Feb/21 ]

bruce.lucas I attached screenshots from t2 of metrics we continue to collect after the storage engine shutdown. In the screenshots, the server was idle for a minute then I hung shutdown after the storage engine shutdown for another minute. Both tcmalloc and extra_info continue to be collected. Am I missing something?

Comment by Bruce Lucas (Inactive) [ 11/Feb/21 ]

SERVER-53708 is an example of the kind of issue we need to be able to detect in testing (without relying on actually hitting OOM as we did in that ticket).

Generated at Thu Feb 08 05:33:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.