[SERVER-61147] Ensure safe deletion of TenantMigrationRecipientAccessBlocker for Shard Merge & MTM. Created: 30/Oct/21  Updated: 16/Oct/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-61141 Create TenantMigrationRecipientAccess... Closed
depends on SERVER-67372 Make tenant migration recipient delet... Closed
is depended on by SERVER-59794 Make existing tenant migration js tes... Closed
Related
is related to SERVER-65236 Make tenant migration donor delete it... Closed
Assigned Teams:
Service Arch
Participants:

 Description   

See design doc for details. SERVER-61141 should've accomplished what we need. Check that the multitenant migrations tests already cover all the TenantMigrationRecipientAccessBlocker cases we need to test for Merge.

EDIT:

In-memory RTAB is used to reject user donor snapshot reads earlier than rejectReadsBeforeTimestamp  after shard merge commit because the first version of shard merge doesn't preserve/copy donor history. In-memory RTAB is deleted once the recipient state document is deleted. Currently, TTL delete the state document after shard merge completion with GC delay of tenantMigrationGarbageCollectionDelayMS (default is 15 mins). This is risky as there is no guarantee that R’s oldest timestamp >= rejectReadsBeforeTimestamp after GC delay. So, after merge commit, we can have readers trying to do donor data reads at snapshot earlier than rejectReadsBeforeTimestamp after GC delay. And, there will be nothing to prevent those readers from reading inconsistent data, violating snapshot read guarantees.

(The same problem exists in MTM protocol as well).



 Comments   
Comment by Suganthi Mani [ 12/Dec/22 ]

Andy, Esha and myself agreed with good-enough solution i.e, GC window to be > > minSnapshotHistoryWindow and is a very small effort solution. So, we doesn't need to bring this into product's radar.

Comment by Suganthi Mani [ 23/Nov/22 ]

As per offline discussion with esha.maharishi@mongodb.com, since this is a bug both in MTM and shard merge, moving this ticket out of Shard Merge project and putting into serverless backlog

Comment by Suganthi Mani [ 04/Apr/22 ]
  1. We decided not to proceed with the hack GC delay > snapshot window, instead directly implement the cleaner solution to address the problem which will garbage collect (delete) the state document only when the node's current oldest timestamp is greater than rejectReadsBeforeTimestamp and not involve the TTL to garbage collect the document via expireAt field.
  2. Couple of ways to implement the cleaner solution that came out of first round of brainstorming with Esha.

Approach#1 Use the recipient state machinery to do async polling of node's oldest timestamp and deletion of state document.

  • con: We can still have a slight race where the donor snapshot reads can be retried on stale R secondary whose oldest timestamp is not yet > rejectReadsBeforeTimestamp but the state doc got deleted. To be noted, donor primary communicates the majority commit point through heartbeat / oplog channel . Only on the majority commit point advancement, stable ts & oldest time moves forward.
  • pro: changes are just in recipient state machine code.

Approach#2: Register a hook to WiredTigerKVEngine]()::setOldestTimestampFromStable] to delete the state document when current oldest timestamp is greater than rejectReadsBeforeTimestamp.

  • con: Complexity in writing the new hook , scanning & disk writes as part of heartbeat or oplog application code path can lead to unintended consequences(not sure though)

Approach#3: Use a separate reaper service to monitor the oldest timestamp and deletion of state document.

  • con: Use of a new thread pool for the reaper service.

 

Comment by Suganthi Mani [ 04/Apr/22 ]

Making GC delay > snapshot window can still be racy and lead to incorrect data response for snapshot reads (see here for more discussion on it)

My understanding is that TTL monitor expires/deletes the state document based on wall clock time. But minSnapshotHistory window doesn't move based on wall clock time. Let me try to explain what I am thinking with an example (sorry if it's too complicated)

Let's say GC delay time is set to 6 mins > Snapshot history window (= 5 mins).

  1. R writes the state document 'expireAt' field to 8.06 A.M (wall clock time + GC delay time) at a timestamp with wall clock time equivalent to 8.00 A.M
  2. Say the majority replication lag is 3 mins. This would cause the above written state document to get majority committed delayed by 3 mins . This in turn implies that only at wall clock time 8.03 A.M, the stable timestamp on R will be at a timestamp with wall clock time equivalent to 8.00 A.M.
    1. This indicates that only after wall clock time 8.08 A.M, any reads that tries to do a snapshot at a timestamp with wall clock time equivalent to 8.00 A.M, would get "SnasphotTooOld" error. ((see here how they calculate the history window).
  3. Say, our current wall clock time is 8.07 A.M, so now the TTL monitor deletes the state document written in step # 2. This would result in tenant access blocker to get removed.
  4. Now, Say, our current wall clock time is 8.08 A.M, the client tries to read donor data at a snapshot at a timestamp with wall clock time equivalent to 7.59 A.M. This would not return "SnapshotTooOld" error as we don't have access blocker and that's a bug.

 

Comment by A. Jesse Jiryu Davis [ 27/Mar/22 ]

The existing tests cover new Shard Merge functionality. However, we're planning to enable snapshot reads in Serverless and we must ensure access blockers remain for minSnapshotHistoryWindowInSeconds. If minSnapshotHistoryWindowInSeconds < tenantMigrationGarbageCollectionDelayMS:

1. Client starts snapshot session on donor D.
2. D replies to client's first query with read timestamp T.
3. Client stores T, and continues querying D at T.
4. D donates all its data to recipient R and the migration commits.
5. Client attempts to query R at T, R correctly replies "SnapshotTooOld" since T predates the migration.
6. After tenantMigrationGarbageCollectionDelayMS, R deletes its access blockers.
7. If the client is somehow still trying to query R at T, R will attempt to read tenant data at T. R had no data for the tenant at that time, so it will reply with empty results. This is incorrect, R should reply with "SnapshotTooOld".

The default tenantMigrationGarbageCollectionDelayMS is 15 minutes, the default minSnapshotHistoryWindowInSeconds is 5 minutes, so this bug won't usually be possible. But let's either enforce this relationship, or set the GC delay to max(tenant GC delay, snapshot window) to be sure.

Comment by A. Jesse Jiryu Davis [ 13/Jan/22 ]

Ensure that the TenantMigrationRecipientAccessBlocker rejects such reads with "SnapshotTooOld" if the readConcern's "atClusterTime" predates the merge. Ensure access blockers remain for at least minSnapshotHistoryWindowInSeconds.

Generated at Thu Feb 08 05:51:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.