[SERVER-48221] Shut down ftdc after storage engine Created: 14/May/20  Updated: 29/Oct/23  Resolved: 13/Nov/20

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 4.9.0, 4.4.3

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File metrics.2020-11-12T15-03-08Z-00000    
Issue Links:
Backports
Related
related to SERVER-58026 Omitted FTDC sections cause frequent ... Backlog
related to SERVER-59065 CatalogStats uses unsafe CollectionCa... Closed
related to SERVER-25042 Start diagnostic data collection as e... Closed
related to SERVER-27692 Can diagnostic data capture be stoppe... Closed
related to SERVER-54472 Collect additional FTDC metrics durin... Closed
is related to SERVER-52815 Investigate calling 'rollback_to_stab... Backlog
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4, v4.2
Sprint: Execution Team 2020-06-29, Execution Team 2020-07-13, Execution Team 2020-07-27, Execution Team 2020-09-21, Execution Team 2020-10-05, Execution Team 2020-11-16
Participants:

 Description   

We've encountered cases of issues in WT shutdown like WT-6164 where diagnosis was difficult (and likely impossible in the field) because ftdc is shut down before the storage engine. We should move ftdc shutdown after storage engine shutdown.



 Comments   
Comment by Githook User [ 14/Dec/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48221 Shut down ftdc after storage engine

(cherry picked from commit 185000ad894d5cb95a3c946158712054db57cb7a)
Branch: v4.4
https://github.com/mongodb/mongo/commit/73267a2a82d164a3457805c80e2173a5a4f1db60

Comment by Githook User [ 14/Dec/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48221 Shut down ftdc after storage engine

(cherry picked from commit f7274c2082f729c1715b2dfd7fd233016a2a58a6)
Branch: v4.4
https://github.com/10gen/mongo-enterprise-modules/commit/15a7aca8711be9f9009c17f26d65bd734e7f488c

Comment by Githook User [ 13/Nov/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48221 Shut down ftdc after storage engine
Branch: master
https://github.com/mongodb/mongo/commit/185000ad894d5cb95a3c946158712054db57cb7a

Comment by Githook User [ 13/Nov/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-48221 Shut down ftdc after storage engine
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/f7274c2082f729c1715b2dfd7fd233016a2a58a6

Comment by Gregory Wlodarek [ 12/Nov/20 ]

I've split off the rollback_to_stable before WT_CONNECTION::close work into SERVER-52815 to make it easier to review the two changes separately.

Comment by Ian Whalen (Inactive) [ 31/Aug/20 ]

gregory.wlodarek after you finish this work can you please talk with Mark so we can figure out what, if anything, we should do with SERVER-27692.

Comment by Haribabu Kommi [ 26/Aug/20 ]

There are two issues that I remember that are happened when WT is shutting down.

  1. An increase in memory usage lead to OOM error
  2. Shutdown operation is very slow

The increase in memory usage that is mentioned in WT-6164 is related to the handling of clearing the removed keys from the data store led to recursively loading history store pages into the cache. This issue happened during the final checkpoint that happens during WT shutdown.

The shutdown rollback to stable operation reads many unnecessary pages into cache led to the slowness of the shutdown operation.

Having FTDC statistics during the shutdown can be identified easily. Performing rollback to stable by MDB itself before calling the connection close can get the metrics related to rollback to stable. Still, we may need to collect all the metrics that are happened during WT shutdown are also required to find out the first problem.

Comment by Bruce Lucas (Inactive) [ 25/Aug/20 ]

Mostly that makes sense and it's no loss not to report metrics on subsystems that have shut down, but serverStatus contains some system metrics that would be useful and should be still valid, if it is possible to continue to collect, such as tcmalloc and (less importantly) extra_info.

Comment by Gregory Wlodarek [ 25/Aug/20 ]

bruce.lucas, it should be possible to continue gathering system metrics (external to MongoDB) but not any server status metrics (internal to MongoDB) as the state of the objects that are used to generate the metrics are unknown and may be unsafe to access as shutdown cleans them up.

Comment by Bruce Lucas (Inactive) [ 25/Aug/20 ]

I think that sounds reasonable (if true).

If it's not too much additional work it would also be useful to move ftdc shutdown after wt shutdown (omitting wt metrics) in case there are additional surprisingly expensive operations, or will be in the future. Even though wt metrics would not be available, in such a case there might still be enough information in the remaining metrics, e.g. disk and cpu, to be useful. Also it would prepare for the possibility of improving wt so that its metrics are available during shutdown.

Comment by Daniel Gottlieb (Inactive) [ 25/Aug/20 ]

It's been a while, but I believe bruce.lucas and I discussed this offline, somewhere. IIRC, bruce.lucas was mostly interested in having FTDC for WT statistics while WT was shutting down. This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close.

Assuming that's the correct motivation/area of interest to target, one compromising solution is that MDB can explicitly call rollback_to_stable on shutdown while FTDC is still running. This would succeed if the following assumptions are true:

  • WT_CONNECTION::close would not need to duplicate much, if any of the work accomplished by MDB calling rollback_to_stable.
  • WT produces meaningful metrics while rollback_to_stable is running
Comment by Eric Milkie [ 11/Jun/20 ]

For now FTDC will only be able to get metrics outside of WiredTiger, as it is not currently safe to call into WiredTiger to fetch metrics while another thread is in connection->close().

Comment by Daniel Gottlieb (Inactive) [ 27/May/20 ]

bruce.lucas just to clarify would you expect FTDC to be attempting to get statistics from WT while it's in WT_CONNECTION::close; it seems like they would be useful, but there's obviously a safety problem. Or are you only interested in all of the other server metrics?

Comment by Bruce Lucas (Inactive) [ 14/May/20 ]

Possibly this could be coupled with SERVER-25042 (start ftdc as early as possible) as that has also caused diagnostic difficulties.

Generated at Thu Feb 08 05:16:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.