[SERVER-53188] Consider implementing a cursor 'reserve' feature Created: 02/Dec/20 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Janna Golden | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | pavi-interest | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Storage Execution
|
| Participants: |
| Description |
|
This came up as a part of the resharding project, as resharding oplog application is subject to write skew. An initial thought was that if the server had a reserve feature that exposed WT's reserve method, we could use this to generate write conflicts and avoid write skew. For now, we have instead implemented functionality that will first read a doc, and then do an unreplicated no-op update on this doc in order to get the desired behavior. After speaking with louis.williams , we think that it's still worthwhile to consider implementing a 'reserve' feature to be used instead because: 1. This code path will be executed for every oplog entry a given resharding recipient applies, and a 'reserve' method would have less of an impact on the WT cache. |
| Comments |
| Comment by Alexander Gorrod [ 29/Jan/21 ] |
|
Thanks for following up and for the detailed explanation. It's not simple to change the WiredTiger reserve operation to remain in place after a WiredTiger transaction is resolved, so it sounds like the no-op approach is the best solution. |
| Comment by Daniel Gottlieb (Inactive) [ 27/Jan/21 ] |
|
To follow up on Alex' questions:
Correct
That's an astute observation, but I don't think it's the kind of timestamps you're concerned with. These operations will replicate normally(-ish)[1] so they'll get legal timestamp values (meaning the update chain will always be in ascending/descending timestamp order, whichever order you prefer to think about it). Timestamps from the sharding perspective make this mechanism "fun". Consider an update that changes a shard key on a document. This turns into a delete on one shard and an insert into another. MDB does not guarantee that the TS(delete) < TS (insert), so when resharding is applying these oplog entries from different oplogs, it has to resolve these sorts of conflicts. But the way to think about this from WT's perspective is that a primary does all the work of "cloning" some initial data (replicated) and applying the changes (which may have spanned across multiple shards as the document moves). It's not possible in all cases to tell the exact order the global set of oplog entries need to be applied, so resharding has an algorithm where, effectively, each individual oplog is applied in order, but the operations across oplogs can be applied concurrently. [1] Some discussion brought up there might be cases in development that result in untimestamped updates. We'll take a closer look and make sure any of those are squashed. |
| Comment by Janna Golden [ 27/Jan/21 ] |
|
louis.williams thanks for following up, in that case reserve does not have the desired behavior for the resharding use case. daniel.gottlieb and I walked through a number of possible concurrent transactions that could arise during resharding, and determined that it is important that a conflict is still generated even after a transaction commits in order to ensure correctness. |
| Comment by Alexander Gorrod [ 15/Jan/21 ] |
|
I think it's also relevant to know two further things: 1) What does a no-op operation translate to in terms of a WiredTiger operation. Does it translate into an insert operation of exactly the same document as previously existed? |
| Comment by Louis Williams [ 14/Jan/21 ] |
|
After talking with agorrod, I understand that reserve has a distinct behavioral difference from a no-op update:
daniel.gottlieb/janna.golden, given this difference, can you confirm whether reserve has the desired behavior for the resharding use case? That is, would resharding be able to temporarily reserve a record and then roll-back its transaction when it no longer needs to generate conflicts, or does a conflict need to persist after a transaction completes? |