[SERVER-63586] Retry to recover the sharding state until it succeeds Created: 11/Feb/22  Updated: 29/Oct/23  Resolved: 07/Mar/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Antonio Fuschetto Assignee: Allison Easton
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21
Participants:
Linked BF Score: 32

 Description   

When a shard starts, if the sharding state recovery document indicates that were metadata change operations in flight, it contacts the primary config server in order to retrive the most recent opTime.

This procedure should retry until it succeeds, but there is a corner case causing the shard process to crash: when the returned command status is NamespaceExists (perfectly expected scenario), the logic also checks the write concern status and possibly raises an error. If the primary config server stepped down, the write concerne status would be InterruptedDueToReplStateChange, the error is converted to an exception by the caller and process crashes.
 
A possible solution would be to retry the command for the primary config server when the write conversion status is not ok and the command status is part of a specific list of errors (that includes NamespaceExists).



 Comments   
Comment by Githook User [ 07/Mar/22 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-63586 Retry to recover the sharding state until it succeeds
Branch: master
https://github.com/mongodb/mongo/commit/57e6550a16d66d503ee2402046637b00409e8a0d

Generated at Thu Feb 08 05:58:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.