removeShard checks are inherently racy

    • v4.4
    • Sharding 2020-03-23
      removeShard does a series of checks before marking a shard as "draining" (aka to be removed) on the config server, including:

      • only one shard should be "draining" at a time
      • can't remove the last shard
      • the shard to be removed should not already be "draining"

      Relevant code: https://github.com/mongodb/mongo/blob/907ed32a3a8bd19f883836013530f645522a75bc/src/mongo/s/catalog/replset/sharding_catalog_client_impl.cpp#L500-L544

      However, these checks are not guarded by a distributed lock (or even an in-process lock for a single mongos), and so two removeShard requests to either two different mongoses or the same mongos can pass all checks concurrently and remove two shards at once.

      This can be fixed by the new locking mechanism being added for the zone sharding project.

