[SERVER-56170] Investigate why some oplog entries generated during tenant migration for timeseries bucket collections require stricter than normal idempotency guarantees Created: 19/Apr/21  Updated: 17/May/21  Resolved: 17/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Dan Larkin-York Assignee: Dan Larkin-York
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-55501 Avoid element-wise iteration and copy... Closed
Sprint: Execution Team 2021-05-31
Participants:

 Description   

During the course of SERVER-55501, we added an optimization for oplog diff application for certain scenarios where we know about the structure of the pre-image and the diff, and can guarantee that fields which are inserted by the diff do not already exist in the pre-image.

In the case of updates that happen as a result of timeseries inserts through the normal BucketCatalog machinery, we know that the resulting oplog entry which is applied on the primary should satisfy these conditions. Additionally, we know that the corresponding entry when applied on a secondary in steady state should also qualify.

What we found is that tenant migrations throw some wrenches in the work here. In particular, it looks like we need to disable the optimization on the primary even when the write goes through the bucket catalog, if the write comes from a tenant migration replaying the oplog. After talking it through a bit, lingzhi.deng and dan.larkin-york came to the conclusion that the secondary should in theory be able to apply any entries generated from the primary blindly with the optimization, without checking if they resulted from a tenant migration - however, this didn't appear to be the case. Some still resulted in field duplication, and thus required the check for tenant migration source.

It remains unclear why we sometimes generate these entries which require the strict idempotency guarantees which normally are not required for writes coming through the BucketCatalog. It may be that something is going wrong at the BucketCatalog layer, or it may be that tenant migrations are doing something unexpected, or any number of other things. The goal of this ticket is simply to understand what's going on here.



 Comments   
Comment by Geert Bosch [ 17/May/21 ]

So, to confirm my understanding, tenant migration depends on oplog application being idempotent, even on the primary in normal operation? If so, it seems it seems reasonable to include tenant migration among the conditions to not apply our optimization.

Comment by Dan Larkin-York [ 17/May/21 ]

After digging through code and discussing the expected semantics of the v:2 doc_diff application, the current behavior seems to be expected. I'll summarize below.

During a tenant migration, the recipient primary performs an initial sync procedure with the donor primary. First it gets a dump of the collection, then it catches up on any changes since the dump by replaying a portion of the donor's oplog. The tricky bit here is that the portion of the oplog that it replays may contain some operations that were already reflected in dump. That is to be expected, but it interacts in a funny way with the v:2 doc_diff format.

The v:2 doc diff format has a crucial property for idempotency: that when you reapply any suffix of the diff chain in order, you'll end up at the same result. That is, if you have two diffs x and y, you can end up getting a chain like:

apply(A, x) -> B
apply(B, y) -> C
apply(C, x) -> D
apply(D, y) -> C

Now, since this oplog replay is happening on the recipient primary, any update that isn't a no-op will generate a new oplog entry. So in this chain, each application would result in a new oplog entry, even though we end up back at the same state (C) as we were at a previous step. And subseuqently, each of these oplog entries will be applied on the recipient secondary.

Where this matters for the case of the optimization introduced in SERVER-55501, is in the case of field insertion. The v:2 doc_diff format treats the insertion of a field that already exists as a reinsertion (or move-to-end). Thus, when we insert a new measurement for a timeseries bucket document that already exists, it's reinserted, and we generate a new oplog entry instead of treating it as a no-op. And thus we need to disable the optimization and use full idempotency guarantees for any oplog entries generated by tenant migrations.

Importantly, we need to take note if any future projects introduce a similar mechanism to tenant migration where a primary replays a portion of an oplog that overlaps with operations it has already applied, and add exceptions to the optimization for these as well.

In doc diff v:3, we should be able to introduce a new type of insert operation (insert2 or something) which does not perform reinsertion in case the field already exists. That would render such replay operations for timeseries collections no-ops, and would not generate new oplog entries.

Comment by Dan Larkin-York [ 19/Apr/21 ]

Assigning to storage execution. Replication can assist if needed.

Generated at Thu Feb 08 05:38:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.