[SERVER-72254] Chunk Migration should fail immediately when session migration fails. Created: 19/Dec/22  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Kshitij Gupta Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by SERVER-56185 Investigate possible improvements wit... Closed
Related
Assigned Teams:
Cluster Scalability
Participants:
Case:
Story Points: 3

 Description   

Migration destination manager on the recipient starts fetching session information at the beginning of the move chunk process. This fetch happens on a separate thread. If SessionCatalogMigrationDestination fails due to any issues (e.g. Operation Interrupted) then we record the failure but we do not abort the chunk migration.

 

MigrationDestinationManager does eventually check the status of Session Migration and fails if the status is ErrorOccurred but this check is not done until the very end of chunk migration. So chunk migration won’t immediately fail even if session migration has failed.

This can cause an issue where a Chunk Migration can get stuck for 6 hours (timeout) because one of the conditions for the donor to engage the critical section is that session migration succeeded so the donor will keep waiting for 6 hours for the recipient to finish session migration while the recipient is waiting on the donor to engage the critical section. The donor will keep retrying until it times out in 6 hours.


Generated at Thu Feb 08 06:21:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.