[SERVER-66235] Clear sync source and buffer when applying recipient config during a shard split Created: 05/May/22  Updated: 29/Oct/23  Resolved: 10/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Didier Nadeau Assignee: Didier Nadeau
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Participants:

 Description   

When the recipient nodes apply the split config to join the new replica set, they don't immediately clear and reset their sync source and the buffer/fetcher/applier. The result is the nodes still receives oplog from the donor replica set for some time after joining the recipient replica set, even after a primary has been elected in the recipient.

This was discovered through the following bug :

  • The recipient nodes apply the recipient config from the split config
    • They remove their TenantAccessBlockers at the same time
  • The donor primary notice the recipients have joined the recipient set, it sends a replSetStepUp
  • A recipient is elected as primary, the donor primary updates its state document to "committed"
  • The recipient secondaries (not primary) receive and apply the oplog "state document updated to committed".
    • This must necessarily come from the donor set as the recipient primary never sets the document to committed.
  • The recipient secondaries crash as the OpObserver tries to remove the TenantAccessBlocker when the state document is updated to committed.
    • The TenantAccessBlocker have been remove on reconfig in the recipient nodes.

 

Fix :

On every batch of message `ReplicationCoordinatorImpl::shouldChangeSyncSource` is invoqued. When it receives the first batch following the split reconfig, this methods returns hangeSyncSourceAction::kStopSyncingAndEnqueueLastBatch. As the name implies it reset the syncSource, but it process the batch of message received. This makes sense in a normal reconfig/primary change as the node would receive committed oplog, therefore it should apply these oplogs even if the primary changed (the new primary would normally have these oplogs too). However for a split we want a clean break and don't want any oplog after the reconfig to be applied on the recipient. Therefore ReplicationCoordinatorImpl::shouldChangeSyncSource should return ChangeSyncSourceAction::kStopSyncingAndDropLastBatchIfPresent.



 Comments   
Comment by Githook User [ 10/May/22 ]

Author:

{'name': 'Didier Nadeau', 'email': 'didier.nadeau@mongodb.com', 'username': 'nadeaudi'}

Message: SERVER-66235 Clear sync source following a shard split reconfig
Branch: master
https://github.com/mongodb/mongo/commit/bf72d48d9140117e4d15df48b0a611aeaff66f5e

Generated at Thu Feb 08 06:04:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.