[SERVER-42343] WiredTigerLAS.wt grows when lagged node is in maintenance mode Created: 23/Jul/19 Updated: 29/Oct/23 Resolved: 14/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.20 |
| Fix Version/s: | 3.4.23 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Bartley | Assignee: | Benety Goh |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Execution Team 2019-07-29, Execution Team 2019-08-12, Execution Team 2019-08-26 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
As part of rolling index builds and similar operations, we'll often take nodes offline for O(hours). When the node comes back online, we'll put it into maintenance mode (using replSetMaintenance) until the node catches back up. We've observed that such nodes often end up with a huge WiredTigerLAS.wt file. We've done some investigation and it seems to be the case that this only happens when we put the node into maintenance mode; if we simply let a lagged node stay in secondary mode, we don't see WiredTigerLAS.wt grow. We suspect this is related to cache pressure of majority read concern since we only started seeing these issues when we enabled that. |
| Comments |
| Comment by Danny Hatcher (Inactive) [ 22/Aug/19 ] | |||||||||
|
I've confirmed that my repro does not grow the WiredTigerLAS.wt file while running with this commit. | |||||||||
| Comment by David Bartley [ 14/Aug/19 ] | |||||||||
|
Thanks for fixing this so quickly! | |||||||||
| Comment by Githook User [ 14/Aug/19 ] | |||||||||
|
Author: {'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}Message: This avoids accumulating unnecessary historical information in the storage engine while | |||||||||
| Comment by Eric Milkie [ 24/Jul/19 ] | |||||||||
|
My guess is that the bug is due to this code in the snapshot thread:
So it's probably fine logic to do this for times when the node is in RECOVERING state (the maintenance mode command sets this state), but we also need to delete all active snapshots when this occurs, to unpin the pages involved in those snapshots. | |||||||||
| Comment by David Bartley [ 23/Jul/19 ] | |||||||||
|
Yup, we've been SIGKILLing the node (SIGTERM works too but sometimes takes many minutes). | |||||||||
| Comment by Danny Hatcher (Inactive) [ 23/Jul/19 ] | |||||||||
|
Hello bartle, I've confirmed what you suspected in my own environment. On 3.4.20 with Read Concern "Majority" enabled a node placed into RECOVERING via
will see a rise in the WiredTigerLAS.wt file size (given writes are happening on the Primary). I have also confirmed that this does not happen on 3.6.0. I imagine this was one of the many items fixed by the changes we made to Read Concern "Majority" in 3.6. If you restart the node, the file should reset. I understand that this isn't really optimal and I will see if it will be possible to do a backport. |