[SERVER-23666] ShardRegistry lookup dictionaries should provide consistent erase Created: 12/Apr/16  Updated: 05/Apr/17  Resolved: 17/Jun/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Misha Tyulenev Assignee: Misha Tyulenev
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 16 (06/24/16)
Participants:

 Description   

Currently shardRegisty contain several dictionaries and it results in the inconsistent update and erase operations.
Need to provide a delete(ShardId) that will delete all related information. This will help to avoid calling getConnString when not needed avoiding locking conflicts.



 Comments   
Comment by Misha Tyulenev [ 17/Jun/16 ]

The issues with connectionStirng 1) and 2) are resolved. The maps reorg is not needed.

Comment by Misha Tyulenev [ 13/Apr/16 ]

The current design makes it easy to forget to clean up something from the map when shard need to be updated leaving shardRegistry in the inconsistent state. A better design can use RAII to make an update consistent - i.e. erasing a shard should be only allowed by Id and this should clean up host->shard and rsname->shard mapping.

Comment by Spencer Brody (Inactive) [ 13/Apr/16 ]

#1 and #3 make sense to me. I'm still not sure what you mean by #2 - cleaning up the maps is "inconsistent".

Comment by Misha Tyulenev [ 13/Apr/16 ]

spencer if you look at how _lookup, _hostLookup, and _rsLookup maps are maintained in https://github.com/mongodb/mongo/blob/master/src/mongo/s/client/shard_registry.cpp#L451 then it will be more clear what I mean.

   if (originalShard) {
        auto oldConnString = originalShard->getConnString();
 
        if (oldConnString.toString() != connString.toString()) {
            log() << "Updating ShardRegistry connection string for shard " << originalShard->getId()
                  << " from: " << oldConnString.toString() << " to: " << connString.toString();
        }
 
        for (const auto& host : oldConnString.getServers()) {
            _lookup.erase(host.toString());
            _hostLookup.erase(host);
        }
    }

i.e. it cleans up 2 maps that have dependencies on shardId. Later those maps are repopulated with the new data from the connection string.

There are three problems here.
1) connString of the shard can be stale because its stored in the shard object. When a shard is an RS the current connection string is known by RSMonitor
2) the clean up of all isolated maps is inconsistent.
3) the _configServerCS also should not be stored there for the same reason, as the shard connect string , instead it should be taken from the config shard.

solving those requires acquiring locks with more control e.g. using lock ordering techniques.

Current getConnString can not cause locking conflicts because its stored in the shard object - but the change which is currently in review changes it to call RSMonitor that requires to take a lock.

Comment by Spencer Brody (Inactive) [ 13/Apr/16 ]

Also, how does calling getConnString cause locking conflicts?

Comment by Spencer Brody (Inactive) [ 13/Apr/16 ]

misha.tyulenev, this is news to me - do you have a case of those maps getting out of sync?

Generated at Thu Feb 08 04:04:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.