Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66235

Clear sync source and buffer when applying recipient config during a shard split

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Fully Compatible

      When the recipient nodes apply the split config to join the new replica set, they don't immediately clear and reset their sync source and the buffer/fetcher/applier. The result is the nodes still receives oplog from the donor replica set for some time after joining the recipient replica set, even after a primary has been elected in the recipient.

      This was discovered through the following bug :

      • The recipient nodes apply the recipient config from the split config
        • They remove their TenantAccessBlockers at the same time
      • The donor primary notice the recipients have joined the recipient set, it sends a replSetStepUp
      • A recipient is elected as primary, the donor primary updates its state document to "committed"
      • The recipient secondaries (not primary) receive and apply the oplog "state document updated to committed".
        • This must necessarily come from the donor set as the recipient primary never sets the document to committed.
      • The recipient secondaries crash as the OpObserver tries to remove the TenantAccessBlocker when the state document is updated to committed.
        • The TenantAccessBlocker have been remove on reconfig in the recipient nodes.

       

      Fix :

      On every batch of message `ReplicationCoordinatorImpl::shouldChangeSyncSource` is invoqued. When it receives the first batch following the split reconfig, this methods returns hangeSyncSourceAction::kStopSyncingAndEnqueueLastBatch. As the name implies it reset the syncSource, but it process the batch of message received. This makes sense in a normal reconfig/primary change as the node would receive committed oplog, therefore it should apply these oplogs even if the primary changed (the new primary would normally have these oplogs too). However for a split we want a clean break and don't want any oplog after the reconfig to be applied on the recipient. Therefore ReplicationCoordinatorImpl::shouldChangeSyncSource should return ChangeSyncSourceAction::kStopSyncingAndDropLastBatchIfPresent.

            Assignee:
            didier.nadeau@mongodb.com Didier Nadeau
            Reporter:
            didier.nadeau@mongodb.com Didier Nadeau
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: