[SERVER-37860] Only archive WT journal files after a successful MongoDB startup. Created: 01/Nov/18  Updated: 29/Oct/23  Resolved: 24/Jun/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.7.0

Type: New Feature Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Gregory Noma
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-6428 Fixes for checkpoint retention Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2020-06-15, Execution Team 2020-06-29
Participants:
Case:
Linked BF Score: 15

 Description   

Once WT starts up, it can start archiving journal files that are no longer needed. In cases where MongoDB startup fails, there may be clues from the previous run still left in the journal files that can provide some insight into how the system got into a bad state.



 Comments   
Comment by Githook User [ 24/Jun/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-37860 Retain some WT log files on startup from previous runs when testing is enabled
Branch: master
https://github.com/mongodb/mongo/commit/4b7e5ea24f29e13d13a17c9fa2ba88b542045ed6

Comment by Susan LoVerso [ 01/Jun/20 ]

daniel.gottlieb this is a fortuitous time to resuscitate this discussion.

There are two debug_mode settings that might be useful to you. One was only added last week. In WT-6302, we added a debug_mode=(log_retention=#) API for debugging. Here's the description from dist/api_data.py:

            adjust log archiving to retain at least this number of log files, ignored if set to 0.
            (Warning: this option can remove log files required for recovery if no checkpoints
            have yet been done and the number of log files exceeds the configured value. As
            WiredTiger cannot detect the difference between a system that has not yet checkpointed
            and one that will never checkpoint, it might discard log files before any checkpoint is
            done.

So if you can know that "recent" logs would be helpful, particularly for testing/debugging, you can set this as an archiving lag.

Similarly, you can not worry about how many log files are needed and just keep the logs for the last N checkpoints with debug_mode=(checkpoint_retention=#}.

Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ]

We now have BFs where this would help. Particularly now that we capture operations as no-ops in the WT journal, the journal has regained its utility and also because out of order timestamps (ghost timestamps) are no longer supported by the implementation.

sue.loverso mentioned a path to making this happen:

If MongoDB rapidly checkpoints during its startup then yes, archiving will happen but MongoDB is in control of when archiving will happen based on when checkpoint is called the first time.

I'm certain that's not part of the public contract WT provides, but if it's still true, that should satisfy the BF problem. Ideally we'd prevent checkpointing/truncation until the server is listening on a port. I'm curious if that would cause problems. Replication recovery does do a bunch of writes, but the oldest/stable timestamp is not advanced at startup. I know that for WT, the process of doing a checkpoint is "good hygiene", but I don't know if that also requires oldest/stable moving.

Comment by Daniel Gottlieb (Inactive) [ 01/Nov/18 ]

We could entertain some kind of configuration settings where the slot buffers don't use the file size in the calculation and force the default size or force reconfiguration to change the file size but not the buffers.

Our use case would be for when the application is relatively idle (i.e: before users can connect, but after some internal background threads get kicked off). Additionally, I don't think it'd be a problem to incur a reallocation cost at this stage in the process lifetime.

If MongoDB rapidly checkpoints during its startup then yes, archiving will happen but MongoDB is in control of when archiving will happen based on when checkpoint is called the first time.

MongoDB does typically checkpoint right at startup. That's a clever way to avoid the need to reconfigure the archiving settings! We'd still have to be careful about accumulating to many large, empty WT journal files.

Comment by Susan LoVerso [ 01/Nov/18 ]

Based on comments in the code (thank you 3-years-ago me), the reason file_max is not reconfigurable is because the memory in the log slot buffers may be based on the log file size and we don't want to reallocate that memory on a running system. Reallocating that memory would be a pretty large undertaking in that the logging subsystem can be in use during a reconfig so reallocating them would be disruptive. Specifically, if a small log file size is given, then the buffers are smaller (whereas they're a constant/default with the 100Mb size).

That is the current state of the code. We could entertain some kind of configuration settings where the slot buffers don't use the file size in the calculation and force the default size or force reconfiguration to change the file size but not the buffers. We could consider a force option that does force reallocating the slot buffers and blocking out the logging subsystem during that time.

Also, based on your ticket description, WT already does what you describe. It does not archive after the checkpoint that is part of recovery. It starts archiving after the first post-wiredtiger_open checkpoint completes. If MongoDB rapidly checkpoints during its startup then yes, archiving will happen but MongoDB is in control of when archiving will happen based on when checkpoint is called the first time.

Comment by Daniel Gottlieb (Inactive) [ 01/Nov/18 ]

I expect this ticket has a WT component that may be a bit challenging. I believe WT has to use a new journal file for every restart (and each file is currently always 100MB for MongoDB). When automation keeps restarting a node that's failing at startup, that would add up quickly for files that are mostly zeroes. WT does allow reconfiguring the connection to turn archiving and preallocating journal files on/off, but the file_max can only be passed in on wiredtiger_open. The path I see to implementing this is if the file size could be dynamically adjusted.

sue.loverso would that be a feasible change? Alternatively, do you have any other ideas to help MongoDB control how the WT log files grow/get archived?

Generated at Thu Feb 08 04:47:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.