[SERVER-37421] Include the cause for migration failure on the donor shard in the changelog Created: 02/Oct/18  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Diagnostics, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 2
Labels: cs-subteam1, esha-summer-2019-neweng, max-triage, neweng, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Cluster Scalability
Participants:
Case:
Story Points: 3

 Description   

Currently, if a donor shard fails an ongoing migration, it will not include the error, which caused the failure, which makes it difficult to diagnose such failures. Whenever possible, if a migration fails we should include the reason for the failure.

It might be most practical to do this in the destructor of MoveChunkHelper if it is possible to read what was the last exception thrown there.



 Comments   
Comment by Josef Ahmad [ 16/Nov/20 ]

Whilst populating the error message is straightforward, I've found out that the MoveTimingHelper destructor is unable to change-log the failure in the donor because in most (or all?) failure cases the moveChunk's operation context is in a killed state. I've a few ideas in mind (temporarily push/pop the operation context state to allow logging; or, create a separate operation context for logging; or, relax the state check when logging) but I haven't explored their feasibility. Suggestions welcome.

Generated at Thu Feb 08 04:45:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.