[SERVER-74085] Ensure queries that spill to TemporaryRecordStores checkpoint their data Created: 16/Feb/23  Updated: 19/May/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-73928 Defer lifetime drop of DeferredDropRe... Closed
related to SERVER-75714 Investigate refactor of TemporaryReco... Backlog
related to SERVER-74033 Remove ident force drop in favour of ... Closed
related to WT-10576 Return EBUSY on forced drop if there ... In Code Review
is related to SERVER-74133 Spilling to TemporaryRecordStores in ... Backlog
is related to SERVER-75367 Truncate TemporaryRecordStores for qu... Closed
Assigned Teams:
Storage Execution
Participants:
Linked BF Score: 120

 Description   

We should ensure that queries that spill to the storage engine help pay the cost of the spilling and ensure their data is actually on disk. My proposal is that we ensure queries that spill to disk periodically checkpoint the data for their temporary table, which is supported by the WiredTiger checkpoint() API.

More detail:

We hit a bug in WT (WT-10576) when we try to force drop a TemporaryRecordStore (TRS) that has uncommitted data. This can happen in query stages (e.g. hash agg) that spill using the storage engine inside multi-document transactions and the lifetime of the storage transaction exceeds the lifetime of the table.

We tried to fix this by not using "force" to drop these temporary tables, but because these temporary record stores are not included in any checkpoints yet, we will fail to drop the table for up to 1 minute until the next checkpoint completes, persisting the data to disk.

This raised a different question: if these TRS are not actually spilled to disk, what is the value of using them? We are essentially polluting the storage engine cache and creating more work for the next checkpoint, which could have performance impacts on the system. If we make these queries pay the cost of spilling, we would probably have fewer performance issues for the rest of the system, and also ensure that the tables get dropped faster.



 Comments   
Comment by Louis Williams [ 17/Feb/23 ]

Thanks sulabh.mahajan@mongodb.com.

The intention here is to throttle the use of temporary tables by forcing them to be written to the disk.

Correct. Before we allowed queries to spill to the storage engine, they just wrote directly to a file on disk, so this would be more in-line with that behavior.

I am assuming by this time the transaction has been committed/rolledback otherwise the table would still return EBUSY on dropping.

Also correct. This is assuming that the transaction has rolled back already. Although now I see a problem with my proposal as it relates to multi-document transactions: these multi-document transactions will "spill" to these temporary tables, but because the writes are part of a larger transaction, nothing can be checkpointed at all. So these transactions are just filling up the WT cache and guaranteeing that nothing gets written to disk. I filed SERVER-74133.

Comment by Sulabh Mahajan [ 17/Feb/23 ]

Alexander Gorrod and Sulabh Mahajan, does this proposal seem reasonable? Can you think of any system-wide implications of having individual operations checkpoint a table? These tables are owned by a single thread, untimestamped, and get dropped at the end of the operation anyways.

We have been moving away from having checkpoints on the individual tables outside the global one. The only case that recently came up was a checkpoint after an initial bulk load for indexes. Even that seem to have some issues as these individual checkpoints, even though on a non-timestamped table have a global stable timestamp associated with them. I will have to look further to say if it would be safe to checkpoint an individual table at this point. I am held up with other work at this moment.

We should ensure that queries that spill to the storage engine help pay the cost of the spilling and ensure their data is actually on disk.

If I read that right, the intention here is to throttle the use of temporary tables by forcing them to be written to the disk.

We tried to fix this by not using "force" to drop these temporary tables, but because these temporary record stores are not included in any checkpoints yet, we will fail to drop the table for up to 1 minute until the next checkpoint completes, persisting the data to disk.

I am assuming by this time the transaction has been committed/rolledback otherwise the table would still return EBUSY on dropping.

Comment by Jordi Olivares Provencio [ 16/Feb/23 ]

One potentially optimal point for performing the targeted checkpoint is inside the spill method of the users of TRS.

HashLookup Is a bit more complex to solve as it currently does 1-by-1 spilling if I understand the code correctly.

Comment by Louis Williams [ 16/Feb/23 ]

FYI david.storch@mongodb.com, this would impact the performance of query stages that spill to the storage engine. My general concern is that spilling to these temporary tables aren't really writing anything to disk today and it's creating more work for the async checkpoint thread later on. I can't say how much slower this would be, but ideally, we would at least periodically checkpoint (i.e. not every key) to make sure that we're spilling something to disk.

Comment by Louis Williams [ 16/Feb/23 ]

alexander.gorrod@mongodb.com and sulabh.mahajan@mongodb.com, does this proposal seem reasonable? Can you think of any system-wide implications of having individual operations checkpoint a table? These tables are owned by a single thread, untimestamped, and get dropped at the end of the operation anyways.

Generated at Thu Feb 08 06:26:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.