[SERVER-52815] Investigate calling 'rollback_to_stable' prior to shutting down WiredTiger Created: 12/Nov/20  Updated: 24/Feb/23

Status: Backlog
Project: Core Server
Component/s: Diagnostics, Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Gregory Wlodarek Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-48221 Shut down ftdc after storage engine Closed
is related to SERVER-74331 rollback-to-stables cause mongod shut... Closed
Assigned Teams:
Storage Execution
Sprint: Execution Team 2020-11-16, Execution Team 2020-12-14
Participants:

 Description   

This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close.

Assuming that's the correct motivation/area of interest to target, one compromising solution is that MDB can explicitly call rollback_to_stable on shutdown while FTDC is still running. This would succeed if the following assumptions are true:

  • WT_CONNECTION::close would not need to duplicate much if any of the work accomplished by MDB calling rollback_to_stable.
  • WT produces meaningful metrics while rollback_to_stable is running.


 Comments   
Comment by Gregory Wlodarek [ 15/Dec/20 ]

I'm re-assigning this back to the backlog for now as it's a fragile change and to explore alternatives.

  1. On shutdown, there's a global kill flag that puts newly created operation contexts into an interrupted state right away. FTDC creates a new operation context each time it runs.
  2. rollback_to_stable requires no transactions to be open. That's not guaranteed by the proposed change in the code review. It's more of a best-effort attempt by shutting down the majority of things. In the current code, this is protected by a global exclusive lock, which wouldn't be possible anymore.
  3. Shutting down the SessionCache prior to running rollback_to_stable to support the "no transactions running" check prevents FTDC from being able to access WT statistics and would need a way to circumvent the SessionCache.

One alternative could be for WT to explore permitting the gathering of statistics during WT_CONNECTION::close().

 

Generated at Thu Feb 08 05:29:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.