[SERVER-65478] Fix race condition when removing tenant migration blockers in shard split Created: 12/Apr/22  Updated: 06/Dec/22  Resolved: 08/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Didier Nadeau Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-65236 Make tenant migration donor delete it... Closed
is related to SERVER-61717 Ensure a POS instance remains in the ... Open
Assigned Teams:
Serverless
Participants:

 Description   

The `ShardSplitOpObserver` removes access blockers when the state document is removed due to the ttl index (`ShardSplitDonorOpObserver::onDelete`). However it does not check if the access blocker is currently "used" by another shard split operation for the same tenant. Therefore we can have a race condition where a previous aborted shard split removes blocker for `tenant1` that is used by a currently ongoing shard split.

Scenario :

  • commitShardSplit started for tenant1 for UUID 1
  • commitShardSplit fails and the document becomes "aborted"
  • forgetShardSplit called for UUID 1, ttl index activated
  • commitShardSplit started for tenant1 for UUID 2
  • ttl index removes state document for commitShardSplit UUID 1. It also removes the access blocker for tenant1 in the same operation.
  • commitShardSplit UUID 2 crashes due to an invariant failure (or other UB behavior) as it expects to have an access blocker.

This leads to a crash, but it can also lead to data inconsistency before the crash happens (writes succeed when they shouldn't as the blocker as been removed).



 Comments   
Comment by Matt Broadstone [ 08/Jun/22 ]

Closing, since the issues identified in this ticket will be resolved once we complete the work for SERVER-61717.

Comment by Esha Maharishi (Inactive) [ 08/Jun/22 ]

Marking as related to SERVER-65236, which will remove the TTL index and make the instance's run() delete the state doc (after a delay) itself.

Generated at Thu Feb 08 06:02:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.