[SERVER-53027] Tenant migration donor should stop retrying on recipient responses that indicate the recipient wants to abort Created: 23/Nov/20  Updated: 24/Feb/21  Resolved: 24/Feb/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Won't Fix Votes: 0
Labels: pm-1791_milestone-D, pm-1791_non-cloud-blocking
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-53133 recipientSyncData should return Tenan... Closed
Sprint: Sharding 2021-03-08
Participants:

 Description   

This ticket is to update the donor to stop retrying on recipient responses that indicate the recipient wants to abort the migration.

Right now the donor will keep sending recipientSyncData until success (or getting a shutdown or stepdown error, but jason.zhang will replace those conditions with checking if the instance's cancelation token has been canceled SERVER-51179).



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 24/Feb/21 ]

As discussed above for now this change is not necessary.

Comment by Jack Mulrow [ 24/Feb/21 ]

With Chou's changes, we already implemented the first option from Esha's first comment, right? Like she mentioned, we'd need to make sure the only errors the recipient returns that aren't in the retryable category are "fatal" to the migration, which seems like reasonable behavior to me. That should already be the case, and if not, the stepdown testing we'll add inĀ SERVER-52713 should flush out any errors we missed, and we can handle them as they come up. So I'm fine closing this ticket without more work, lingzhi.deng, how does that sound to you?

Comment by Andrew Shuvalov (Inactive) [ 24/Feb/21 ]

Hi, I have 2 questions:
1. cheahuychou.mao already changed the shouldStopSendingRecipientCommand() here to check that the error from sending Sync to recipient is not retriable

2. The code above doesn't convert the error to TenantMigrationAborted as suggested in message #1. I checked the AsyncTry() implementation used in _sendCommandToRecipient() and it doesn't provide the hook to convert one type of error to another in the callback. So the trick with converting the error could complicated, I don't know how to do it right away but it might be not worth it.

Will all of the above, do you thing there is still anything to do with this ticket?

Comment by Esha Maharishi (Inactive) [ 30/Nov/20 ]

Oh yes, I forgot the recipient should do that because:

the recipient may receive recipientForgetMigration before recipientSyncData, since the donor may decide to abort before the recipient receives recipientSyncData. In this case, recipientForgetMigration inserts the state machine already marked as garbage collectable.

The only case we have right now where the donor can decide to abort is after the donor starts blocking, but what you said is probably a good idea in case that ever changes.

Comment by Lingzhi Deng [ 30/Nov/20 ]

That makes sense. My current implementation for the recipientForgetMigration (still in CR though) would still write a state doc marking the migration garbage collectable and return success even if the recipient hasn't received a recipientSyncData command. So we probably dont need the NoSuchTenantMigration part. But having it is fine in case I change that in my CR.
For recipientSyncData, I filed SERVER-53133.

Comment by Esha Maharishi (Inactive) [ 30/Nov/20 ]

lingzhi.deng , option 2 sounds good to me, and yes, repl will need to guarantee the recipient always returns TenantMigrationAborted for recipientSyncData.

I think for recipientForgetMigration, the donor should keep retrying until hearing back success or NoSuchTenantMigration.

Comment by Lingzhi Deng [ 30/Nov/20 ]

I think doing the second option might be cleaner and more explicit. In that case, I guess we will probably need another ticket for Repl to audit the error code returned by recipientSyncData and recipientForgetMigration command?

Comment by Esha Maharishi (Inactive) [ 29/Nov/20 ]

lingzhi.deng, heads up that I filed this new "cleanup" ticket. I see several options:

  • The donor stops retrying on any non-retryable error, where retryable is defined according to error_codes.yml
    • This means the recipient must only return "fatal" (to the migration) errors.
  • The recipient converts any "fatal" (to the migration) error to TenantMigrationAborted (while preserving the reason in the errmsg), and the donor stops retrying only on "TenantMigrationAborted" errors.
    • This is the TransactionCoordinator's model, where the coordinator only stops retrying on NoSuchTransaction or TransactionTooOld.
  • The donor stops retrying only on a list of "fatal" (to the migration) errors.
    • This makes mixed-version clusters harder to handle, since a newer version recipient may return a new type of "fatal" error.
Generated at Thu Feb 08 05:29:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.