[SERVER-25053] removeShard checks are inherently racy Created: 13/Jul/16  Updated: 09/Sep/20  Resolved: 16/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.3.9
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Alexander Taskov (Inactive)
Resolution: Done Votes: 0
Labels: PM-108, sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-32553 The `removeShard` command is not idem... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-03-23
Participants:
Linked BF Score: 26

 Description   

removeShard does a series of checks before marking a shard as "draining" (aka to be removed) on the config server, including:

  • only one shard should be "draining" at a time
  • can't remove the last shard
  • the shard to be removed should not already be "draining"

Relevant code: https://github.com/mongodb/mongo/blob/907ed32a3a8bd19f883836013530f645522a75bc/src/mongo/s/catalog/replset/sharding_catalog_client_impl.cpp#L500-L544

However, these checks are not guarded by a distributed lock (or even an in-process lock for a single mongos), and so two removeShard requests to either two different mongoses or the same mongos can pass all checks concurrently and remove two shards at once.

This can be fixed by the new locking mechanism being added for the zone sharding project.



 Comments   
Comment by Githook User [ 26/Mar/20 ]

Author:

{'name': 'Alex Taskov', 'username': 'alextaskov', 'email': 'alex.taskov@mongodb.com'}

Message: SERVER-25053 Address race conditions in removeShard

(cherry picked from commit 2c19c31f910e5b336b7f3b206a3d57d202100ae6)
Branch: v4.4
https://github.com/mongodb/mongo/commit/64c81e2b7e9ee58958bb4644e04a20f904be8d31

Comment by Githook User [ 16/Mar/20 ]

Author:

{'username': 'alextaskov', 'name': 'Alex Taskov', 'email': 'alex.taskov@mongodb.com'}

Message: SERVER-25053 Address race conditions in removeShard
Branch: master
https://github.com/mongodb/mongo/commit/2c19c31f910e5b336b7f3b206a3d57d202100ae6

Comment by Esha Maharishi (Inactive) [ 19/Dec/19 ]

removeShard does take the new _kShardMembershipLock, but it takes it after checking if this is the last draining shard.

So, two concurrent removeShards could still both check that they are not the last draining shard, then both mark their shards as draining.

The lock should probably be taken before doing any checks.

Generated at Thu Feb 08 04:08:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.