[SERVER-37421] Include the cause for migration failure on the donor shard in the changelog Created: 02/Oct/18 Updated: 12/Dec/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Diagnostics, Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Backlog - Cluster Scalability |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | cs-subteam1, esha-summer-2019-neweng, max-triage, neweng, sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Cluster Scalability
|
||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Story Points: | 3 | ||||
| Description |
|
Currently, if a donor shard fails an ongoing migration, it will not include the error, which caused the failure, which makes it difficult to diagnose such failures. Whenever possible, if a migration fails we should include the reason for the failure. It might be most practical to do this in the destructor of MoveChunkHelper if it is possible to read what was the last exception thrown there. |
| Comments |
| Comment by Josef Ahmad [ 16/Nov/20 ] |
|
Whilst populating the error message is straightforward, I've found out that the MoveTimingHelper destructor is unable to change-log the failure in the donor because in most (or all?) failure cases the moveChunk's operation context is in a killed state. I've a few ideas in mind (temporarily push/pop the operation context state to allow logging; or, create a separate operation context for logging; or, relax the state check when logging) but I haven't explored their feasibility. Suggestions welcome. |