[SERVER-48221] Shut down ftdc after storage engine Created: 14/May/20 Updated: 29/Oct/23 Resolved: 13/Nov/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Diagnostics |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0, 4.4.3 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Gregory Wlodarek |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2
|
||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2020-06-29, Execution Team 2020-07-13, Execution Team 2020-07-27, Execution Team 2020-09-21, Execution Team 2020-10-05, Execution Team 2020-11-16 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
We've encountered cases of issues in WT shutdown like |
| Comments |
| Comment by Githook User [ 14/Dec/20 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit 185000ad894d5cb95a3c946158712054db57cb7a) |
| Comment by Githook User [ 14/Dec/20 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit f7274c2082f729c1715b2dfd7fd233016a2a58a6) |
| Comment by Githook User [ 13/Nov/20 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: |
| Comment by Githook User [ 13/Nov/20 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: |
| Comment by Gregory Wlodarek [ 12/Nov/20 ] |
|
I've split off the rollback_to_stable before WT_CONNECTION::close work into SERVER-52815 to make it easier to review the two changes separately. |
| Comment by Ian Whalen (Inactive) [ 31/Aug/20 ] |
|
gregory.wlodarek after you finish this work can you please talk with Mark so we can figure out what, if anything, we should do with |
| Comment by Haribabu Kommi [ 26/Aug/20 ] |
|
There are two issues that I remember that are happened when WT is shutting down.
The increase in memory usage that is mentioned in The shutdown rollback to stable operation reads many unnecessary pages into cache led to the slowness of the shutdown operation. Having FTDC statistics during the shutdown can be identified easily. Performing rollback to stable by MDB itself before calling the connection close can get the metrics related to rollback to stable. Still, we may need to collect all the metrics that are happened during WT shutdown are also required to find out the first problem. |
| Comment by Bruce Lucas (Inactive) [ 25/Aug/20 ] |
|
Mostly that makes sense and it's no loss not to report metrics on subsystems that have shut down, but serverStatus contains some system metrics that would be useful and should be still valid, if it is possible to continue to collect, such as tcmalloc and (less importantly) extra_info. |
| Comment by Gregory Wlodarek [ 25/Aug/20 ] |
|
bruce.lucas, it should be possible to continue gathering system metrics (external to MongoDB) but not any server status metrics (internal to MongoDB) as the state of the objects that are used to generate the metrics are unknown and may be unsafe to access as shutdown cleans them up. |
| Comment by Bruce Lucas (Inactive) [ 25/Aug/20 ] |
|
I think that sounds reasonable (if true). If it's not too much additional work it would also be useful to move ftdc shutdown after wt shutdown (omitting wt metrics) in case there are additional surprisingly expensive operations, or will be in the future. Even though wt metrics would not be available, in such a case there might still be enough information in the remaining metrics, e.g. disk and cpu, to be useful. Also it would prepare for the possibility of improving wt so that its metrics are available during shutdown. |
| Comment by Daniel Gottlieb (Inactive) [ 25/Aug/20 ] |
|
It's been a while, but I believe bruce.lucas and I discussed this offline, somewhere. IIRC, bruce.lucas was mostly interested in having FTDC for WT statistics while WT was shutting down. This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close. Assuming that's the correct motivation/area of interest to target, one compromising solution is that MDB can explicitly call rollback_to_stable on shutdown while FTDC is still running. This would succeed if the following assumptions are true:
|
| Comment by Eric Milkie [ 11/Jun/20 ] |
|
For now FTDC will only be able to get metrics outside of WiredTiger, as it is not currently safe to call into WiredTiger to fetch metrics while another thread is in connection->close(). |
| Comment by Daniel Gottlieb (Inactive) [ 27/May/20 ] |
|
bruce.lucas just to clarify would you expect FTDC to be attempting to get statistics from WT while it's in WT_CONNECTION::close; it seems like they would be useful, but there's obviously a safety problem. Or are you only interested in all of the other server metrics? |
| Comment by Bruce Lucas (Inactive) [ 14/May/20 ] |
|
Possibly this could be coupled with |