[SERVER-74085] Ensure queries that spill to TemporaryRecordStores checkpoint their data Created: 16/Feb/23 Updated: 19/May/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Linked BF Score: | 120 | ||||||||||||||||||||||||||||||||
| Description |
|
We should ensure that queries that spill to the storage engine help pay the cost of the spilling and ensure their data is actually on disk. My proposal is that we ensure queries that spill to disk periodically checkpoint the data for their temporary table, which is supported by the WiredTiger checkpoint() API. More detail: We hit a bug in WT (WT-10576) when we try to force drop a TemporaryRecordStore (TRS) that has uncommitted data. This can happen in query stages (e.g. hash agg) that spill using the storage engine inside multi-document transactions and the lifetime of the storage transaction exceeds the lifetime of the table. We tried to fix this by not using "force" to drop these temporary tables, but because these temporary record stores are not included in any checkpoints yet, we will fail to drop the table for up to 1 minute until the next checkpoint completes, persisting the data to disk. This raised a different question: if these TRS are not actually spilled to disk, what is the value of using them? We are essentially polluting the storage engine cache and creating more work for the next checkpoint, which could have performance impacts on the system. If we make these queries pay the cost of spilling, we would probably have fewer performance issues for the rest of the system, and also ensure that the tables get dropped faster. |
| Comments |
| Comment by Louis Williams [ 17/Feb/23 ] |
|
Thanks sulabh.mahajan@mongodb.com.
Correct. Before we allowed queries to spill to the storage engine, they just wrote directly to a file on disk, so this would be more in-line with that behavior.
Also correct. This is assuming that the transaction has rolled back already. Although now I see a problem with my proposal as it relates to multi-document transactions: these multi-document transactions will "spill" to these temporary tables, but because the writes are part of a larger transaction, nothing can be checkpointed at all. So these transactions are just filling up the WT cache and guaranteeing that nothing gets written to disk. I filed SERVER-74133. |
| Comment by Sulabh Mahajan [ 17/Feb/23 ] |
We have been moving away from having checkpoints on the individual tables outside the global one. The only case that recently came up was a checkpoint after an initial bulk load for indexes. Even that seem to have some issues as these individual checkpoints, even though on a non-timestamped table have a global stable timestamp associated with them. I will have to look further to say if it would be safe to checkpoint an individual table at this point. I am held up with other work at this moment.
If I read that right, the intention here is to throttle the use of temporary tables by forcing them to be written to the disk.
I am assuming by this time the transaction has been committed/rolledback otherwise the table would still return EBUSY on dropping. |
| Comment by Jordi Olivares Provencio [ 16/Feb/23 ] |
|
One potentially optimal point for performing the targeted checkpoint is inside the spill method of the users of TRS. HashLookup Is a bit more complex to solve as it currently does 1-by-1 spilling if I understand the code correctly. |
| Comment by Louis Williams [ 16/Feb/23 ] |
|
FYI david.storch@mongodb.com, this would impact the performance of query stages that spill to the storage engine. My general concern is that spilling to these temporary tables aren't really writing anything to disk today and it's creating more work for the async checkpoint thread later on. I can't say how much slower this would be, but ideally, we would at least periodically checkpoint (i.e. not every key) to make sure that we're spilling something to disk. |
| Comment by Louis Williams [ 16/Feb/23 ] |
|
alexander.gorrod@mongodb.com and sulabh.mahajan@mongodb.com, does this proposal seem reasonable? Can you think of any system-wide implications of having individual operations checkpoint a table? These tables are owned by a single thread, untimestamped, and get dropped at the end of the operation anyways. |