-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 5.0.5
-
Component/s: None
-
Fully Compatible
-
ALL
-
v5.0
-
Sharding EMEA 2022-02-21
-
(copied to CRM)
In ShardingCatalogManager::removeShard, the shard membership lock is released before the topology time update to the control shard. If two remove shard commands finish draining at the same time and choose the same control shard but commit their new topology times out of order, the new topology time will not be stored in any of the config.shards entries. This can cause the refresh of the shard registry to not be able to fulfill any promises (because the topology time returned from _lookup is read from config.shards and will be smaller than the time in store) and create an infinite loop of shard registry lookups.
We should hold the shard membership lock during these operations to eliminate the race condition, or at least ensure that the update of the topology time in the control shard is increasing the time, not decreasing it.