[SERVER-58026] Omitted FTDC sections cause frequent schema changes that limit FTDC retention Created: 23/Jun/21 Updated: 07/Nov/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.3, 5.0.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Backlog - Security Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Server Security
|
||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4
|
||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-10-04 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
This can cause frequent schema changes that reduce FTDC compression efficiency and limit retention. For example, in one deployment FTDC retention was reduced to less than 2 days, compared to a typical retention of closer to a week. The missing data can also cause us to miss important events in FTDC. It looks to me like the primary issue might be that we're using an extremely short timeout for acquiring the locks needed for collecting these sections, so it might be sufficient to increase the timeout to a substantial fraction of a second, although that needs verification. |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 30/Nov/22 ] |
|
Thanks for the detail. Yes a few or a few dozen schema changes at some major transition in system state (like startup) is not a problem; the problem comes from repeated sustained schema changes throughout operation. There are already by the way generally a few schema changes during startup as various subsystems come on line and add their data to serverStatus. |
| Comment by Daniel Gottlieb (Inactive) [ 30/Nov/22 ] |
Yes.
I'm rather convinced it's impossible to do SERVER-70031 (and satisfy the desired outcome of gathering FTDC at startup/shutdown) while still having FTDC grab locks (the POC that I've misplaced demonstrated its viability) . For more of a formal argument: startup/shutdown/rollback will continue to take a global lock (such that the system cannot attempt to access WT for reads/writes). If SERVER-70031 succeeds in getting FTDC data during startup/shutdown/rollback – it must not take the global lock. That said, SERVER-70031 could be broken down/redefined into something like "use WT's new API" and stop short of funneling the data into FTDC. Thus I've marked this as "depends on" and not "duplicates". Given the title of this ticket about FTDC, schema changes, one thing SERVER-70031 will likely introduce is the possibility for schema changes during startup. Because the storage engine starts up before other systems that have FTDC hooks (e.g: replication), we'll want to output storage stats before we've even constructed those objects that come into existence later in the startup procedure. The number of schema redefines introduced should be constant (at most one redefine per module that registers an FTDC handler – which is bounded for any given compile/set of startup options). And I don't expect any additional schema redefines after startup due to this change. I assumed that limited amount of new potential schema redefines would be acceptable. |
| Comment by Bruce Lucas (Inactive) [ 30/Nov/22 ] |
|
daniel.gottlieb@mongodb.com, if I understand correctly that is because SERVER-70031 will allow FTDC to no longer take the global lock because it will no longer be needed for coordination with WT startup and shutdown, is that correct? If so, is it the case that that change (no longer taking the global lock for FTDC) will happen as part of SERVER-70031, or is that an additional change that would need to happen via this ticket? |
| Comment by Daniel Gottlieb (Inactive) [ 30/Nov/22 ] |
|
I'm leaving this open for visibility, but I think with (the linked) SERVER-70031, the goal of this ticket will be satisfied. Or at least the goal that's applicable to the storage side of FTDC. |
| Comment by Louis Williams [ 29/Sep/21 ] |
|
bruce.lucas, yes that is my proposal. I think there is a longer-term solution here that would need to be investigated in-depth. I think that is out of the scope of this ticket since our priority is to increase the retention period for FTDC by avoiding schema changes. |
| Comment by Bruce Lucas (Inactive) [ 29/Sep/21 ] |
|
louis.williams to make sure I understand the options are
|
| Comment by Louis Williams [ 28/Sep/21 ] |
|
It turns our that introducing a new synchronization mechanism between FTDC and shutdown is rather straightforward. This simple solution allows us to avoid schema changes except for the shutdown case. What's hard is allowing these problematic FTDC sections to avoid taking the Global lock. Allowing certain operations to skip the global lock requires evaluating all of the places we take the Global X lock and determine whether this is safe or also needs to be synchronized as well. I think there's quite a larger discussion to be had here about the purpose of the global lock and this work seems much riskier. bruce.lucas, would you be willing to accept the more straightforward change that blocks FTDC sections on the Global X lock instead of omitting these sections? |
| Comment by Louis Williams [ 27/Sep/21 ] |
|
I withdrew my code review because it effectively returns incorrect results for FTDC, which is not desirable behavior. I'm going to explore the alternative of introducing a synchronization mechanism just between FTDC and shutdown. |
| Comment by Bruce Lucas (Inactive) [ 23/Sep/21 ] |
|
Not sure, but SERVER-60168 might also be addressed by such a mechanism. |
| Comment by Bruce Lucas (Inactive) [ 23/Sep/21 ] |
I'm not following. There's no reason that FTDC (including sections that access the storage engine) can't be collected during applyOps, correct? If so, doesn't that mean that the problem can be solved by introducing an additional coordination mechanism that prohibits collecting certain FTDC sections only when that's not permissible, in particular during shutdown? |
| Comment by Louis Williams [ 09/Sep/21 ] |
|
bruce.lucas the problem is that the global lock is what is used to coordinate shutdown. Anything that FTDC needs to avoid conflicting with shutdown will also be needed by applyOps, so we don't have many options here. Would it be acceptable to revert the change that allows FTDC to run after storage engine shutdown? |
| Comment by Connie Chen [ 09/Sep/21 ] |
|
We should also consider inserting dummy fields to avoid schema changes. |
| Comment by Bruce Lucas (Inactive) [ 09/Sep/21 ] |
|
louis.williams that my help some, but if my comment ("It appears this is taking several seconds to acquire") is accurate, I'm not sure a timeout of 100ms to 1s will be that helpful. I wonder if a fix whereby FTDC uses something other than the global lock to coordinate with shutdown is possible. |
| Comment by Louis Williams [ 08/Sep/21 ] |
|
After talking with kelsey.schubert the Execution team will try to do something to avoid frequent FTDC schema changes. For FTDC sections that acquire the Global lock, we will impose a higher timeout, maybe something on the order of 100ms to 1s. The goal is to give FTDC a better chance of collecting statistics to avoid a schema change, even if that introduces a temporary stall. We need a timeout, otherwise, FTDC would block shutdown indefinitely. |
| Comment by Kaloian Manassiev [ 07/Jul/21 ] |
|
kelsey.schubert, bruce.lucas: I noticed that this P2 ticket got assigned on the Sharding backlog, but it is not clear to me what the expectation is for fixing it. The Global-X lock as part of these commit operations is not something new to 4.4 or 5.0, but it's been from the beginning of time. Removing applyOps is a non trivial work, tracked under Barring removing applyOps what other options are there to fix the immediate FTDC problem? |
| Comment by Kaloian Manassiev [ 01/Jul/21 ] |
|
The _configsvrCommitChunkSplit command eventually calls applyOps and this will take global X-lock unfortunately. All three of CommitChunkSplit/Merge/Move do that and we have ticket to rewrite them as transactions, but haven't gotten to that yet. I didn't read the rest of the comments so please let me know if there is something else I need to answer. |
| Comment by Louis Williams [ 30/Jun/21 ] |
|
I can't tell from the code where _configsvrCommitChunkSplit is actually taking a global X lock. I'm skeptical that it actually needs to because few things do. kaloian.manassiev, do you know why this operation is taking a global X lock? |
| Comment by Bruce Lucas (Inactive) [ 30/Jun/21 ] |
|
louis.williams, do you have an opinion on whether _configsvrCommitChunkSplit should be taking a global X lock? In any case I'll pass the ticket to the sharding team for their comment. |
| Comment by Louis Williams [ 29/Jun/21 ] |
|
bruce.lucas, I think the problem is that applyOps uses too big of a hammer with its global exclusive lock. Nowadays, we only use the global lock to make operations conflict with the storage engine shutting down. And since FTDC collects statistics from the storage engine, it needs to ensure it does not shut down while doing so. I wonder if we should instead focus on why we're using applyOps for what appear to be routine operations on the config server? Or at least stop using a version of applyOps that has to take a global exclusive lock? From what I can tell, the problem is the use of the "precondition" in the command. Without this (and assuming all operations are CRUD ops), the global lock does not need to be taken. What do you think? |
| Comment by Bruce Lucas (Inactive) [ 29/Jun/21 ] |
|
louis.williams I see your point. On closer inspection the case we saw was on a config server, and the culprits were applyOps and _configsvrCommitChunkSplit, both of which (according to the logs) take a global X lock. It appears this is taking several seconds to acquire, blocking FTDC for the duration. The net result is missing WT and some other sections for several seconds, and schema changes every few seconds which decreases compression and limits retention. It does appear that increasing the lock acquisition timeout isn't the answer, but I'm not sure what is. Logically I think there should be no reason we can't collect wiredTiger metrics during applyOps and _configsvrCommitChunkSplit. Is the problem here that we're using too big a hammer (global lock) to keep FTDC from accessing the storage engine when that's not safe? |
| Comment by Louis Williams [ 28/Jun/21 ] |
|
I'm a little confused about how the linked tickets are causing frequent schema changes.
bruce.lucas, do we have any idea what is conflicting with FTDC collection that prevents it from failing to collect so often? And what do you think is an acceptable timeout so that FTDC does not block and it alos has a better chance of making progress in the event of a long-running blocking operation? 100ms? |
| Comment by Bruce Lucas (Inactive) [ 23/Jun/21 ] |
|
I've marked this as affecting 4.4.4 and requested backports, although I think that only applies to a subset of the issues reported above. |