[SERVER-26761] old ReplicaSetMonitor can be used on config when adding new shard with same setName as recently removed shard Created: 25/Oct/16  Updated: 19/Nov/16  Resolved: 14/Nov/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.4.0-rc4

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Esha Maharishi (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-26785 rewrite addshard2.js to be able to un... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-11-21
Participants:
Linked BF Score: 0

 Description   

The ReplicaSetMonitor is only synchronously (i.e., at the end of removeShard()) removed from the ReplicaSetMonitorManager on the mongos doing the removeShard().

All other processes remove the ReplicaSetMonitor the next time they do a ShardRegistry::reload() (which in the worst case happens every 30 seconds) and notice the shard no longer exists in config.shards.

If, after the removeShard(), a new shard is added with the same replica set name and the config server has not done a ShardRegistry::reload() yet, it will use the old shard's ReplicaSetMonitor to target the new shard (including for the addShard checks).

This is because ReplicaSetMonitorManager::getOrCreateMonitor() indexes ReplicaSetMonitor instances by setName instead of some unique id:

https://github.com/mongodb/mongo/blob/r3.4.0-rc1/src/mongo/client/replica_set_monitor_manager.cpp#L95-L99

1) If the old shard is still up, the addShard() will (incorrectly) fail with error:

"in seed list mySet/hostname:15516, host hostname:15516 does not belong to replica set mySet; found { hosts: [ \"hostname:15515\" ], setName: \"mySet\", setVersion: 1, ismaster: true, secondary: false, primary: \"hostname:15515\",  ..."

2) If the old shard was shut down, by a lucky additional pair of bugs (see SERVER-26759 and SERVER-26760), the old ReplicaSetMonitor will be removed after the first HostUnreachable response for the old shard, a new ReplicaSetMonitor will be created on the retry, and the addShard will (correctly) succeed.



 Comments   
Comment by Githook User [ 14/Nov/16 ]

Author:

{u'username': u'EshaMaharishi', u'name': u'Esha Maharishi', u'email': u'esha.maharishi@mongodb.com'}

Message: SERVER-26761 check and return early if shard already exists in addShard
Branch: master
https://github.com/mongodb/mongo/commit/e82b201cf7d17f64f54b57f58dc9668527ab49b1

Comment by Esha Maharishi (Inactive) [ 25/Oct/16 ]

Potential fix: call ReplicaSetMonitor::remove() for the removed shard in the OpObserver for removes to config.shards.

Generated at Thu Feb 08 04:13:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.