[SERVER-37860] Only archive WT journal files after a successful MongoDB startup. Created: 01/Nov/18 Updated: 29/Oct/23 Resolved: 24/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Gregory Noma |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | Execution Team 2020-06-15, Execution Team 2020-06-29 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Linked BF Score: | 15 | ||||||||||||
| Description |
|
Once WT starts up, it can start archiving journal files that are no longer needed. In cases where MongoDB startup fails, there may be clues from the previous run still left in the journal files that can provide some insight into how the system got into a bad state. |
| Comments |
| Comment by Githook User [ 24/Jun/20 ] | ||||||
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: | ||||||
| Comment by Susan LoVerso [ 01/Jun/20 ] | ||||||
|
daniel.gottlieb this is a fortuitous time to resuscitate this discussion. There are two debug_mode settings that might be useful to you. One was only added last week. In
So if you can know that "recent" logs would be helpful, particularly for testing/debugging, you can set this as an archiving lag. Similarly, you can not worry about how many log files are needed and just keep the logs for the last N checkpoints with debug_mode=(checkpoint_retention=#}. | ||||||
| Comment by Daniel Gottlieb (Inactive) [ 01/Jun/20 ] | ||||||
|
We now have BFs where this would help. Particularly now that we capture operations as no-ops in the WT journal, the journal has regained its utility and also because out of order timestamps (ghost timestamps) are no longer supported by the implementation. sue.loverso mentioned a path to making this happen:
I'm certain that's not part of the public contract WT provides, but if it's still true, that should satisfy the BF problem. Ideally we'd prevent checkpointing/truncation until the server is listening on a port. I'm curious if that would cause problems. Replication recovery does do a bunch of writes, but the oldest/stable timestamp is not advanced at startup. I know that for WT, the process of doing a checkpoint is "good hygiene", but I don't know if that also requires oldest/stable moving. | ||||||
| Comment by Daniel Gottlieb (Inactive) [ 01/Nov/18 ] | ||||||
Our use case would be for when the application is relatively idle (i.e: before users can connect, but after some internal background threads get kicked off). Additionally, I don't think it'd be a problem to incur a reallocation cost at this stage in the process lifetime.
MongoDB does typically checkpoint right at startup. That's a clever way to avoid the need to reconfigure the archiving settings! We'd still have to be careful about accumulating to many large, empty WT journal files. | ||||||
| Comment by Susan LoVerso [ 01/Nov/18 ] | ||||||
|
Based on comments in the code (thank you 3-years-ago me), the reason file_max is not reconfigurable is because the memory in the log slot buffers may be based on the log file size and we don't want to reallocate that memory on a running system. Reallocating that memory would be a pretty large undertaking in that the logging subsystem can be in use during a reconfig so reallocating them would be disruptive. Specifically, if a small log file size is given, then the buffers are smaller (whereas they're a constant/default with the 100Mb size). That is the current state of the code. We could entertain some kind of configuration settings where the slot buffers don't use the file size in the calculation and force the default size or force reconfiguration to change the file size but not the buffers. We could consider a force option that does force reallocating the slot buffers and blocking out the logging subsystem during that time. Also, based on your ticket description, WT already does what you describe. It does not archive after the checkpoint that is part of recovery. It starts archiving after the first post-wiredtiger_open checkpoint completes. If MongoDB rapidly checkpoints during its startup then yes, archiving will happen but MongoDB is in control of when archiving will happen based on when checkpoint is called the first time. | ||||||
| Comment by Daniel Gottlieb (Inactive) [ 01/Nov/18 ] | ||||||
|
I expect this ticket has a WT component that may be a bit challenging. I believe WT has to use a new journal file for every restart (and each file is currently always 100MB for MongoDB). When automation keeps restarting a node that's failing at startup, that would add up quickly for files that are mostly zeroes. WT does allow reconfiguring the connection to turn archiving and preallocating journal files on/off, but the file_max can only be passed in on wiredtiger_open. The path I see to implementing this is if the file size could be dynamically adjusted. sue.loverso would that be a feasible change? Alternatively, do you have any other ideas to help MongoDB control how the WT log files grow/get archived? |