-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Labels:
-
Fully Compatible
-
Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21
-
32
When a shard starts, if the sharding state recovery document indicates that were metadata change operations in flight, it contacts the primary config server in order to retrive the most recent opTime.
This procedure should retry until it succeeds, but there is a corner case causing the shard process to crash: when the returned command status is NamespaceExists (perfectly expected scenario), the logic also checks the write concern status and possibly raises an error. If the primary config server stepped down, the write concerne status would be InterruptedDueToReplStateChange, the error is converted to an exception by the caller and process crashes.
A possible solution would be to retry the command for the primary config server when the write conversion status is not ok and the command status is part of a specific list of errors (that includes NamespaceExists).