Priority: Major - P3
Affects Version/s: None
Fix Version/s: None
Last comment by Customer:true
We decided to just improve documentation in order to help customers avoid this issue.
The recommended improvement is for the Restore a Sharded Cluster page. There are two restore procedures described on the page, and here are the recommended updates to each:
For the "Restore a Sharded Cluster with Filesystem Snapshots" procedure (the first procedure on the page), first modify step 6.1 to specify "without the --shardsvr option", like so:
Restart the replica set members for the shard with the recoverShardingState parameter set to false and without the --shardsvr option.
Then add a step between steps 6.2 and 6.3 that says:
If the config server's host:port has changed, update the document on each shard in admin.system.version where _id equals shardIdentity to have the new config server replica set connection string in the configsvrConnectionString field:
For the "Restore a Sharded Cluster with Database Dumps" procedure (the second procedure on the page), add a step between steps 9 and 10 that says:
If the config servers' hostnames have changed, update the shardIdentity docs in each shard.
1. Restart all the shard mongod instances without the --shardsvr option.
2. Connect a mongo shell to the primary of the replica set and update the document in admin.system.version where _id equals shardIdentity to have the new config server replica set connection string in the configsvrConnectionString field:
3. Shut down the shard mongod instances.
While debugging a customer issue with Christopher Harris, we found that it's fairly easy to attempt to make a backup cluster from a snapshot of a production cluster without deleting the shardIdentity doc from the backup shards.
The backup cluster's shards will have the production cluster's config server connection string in their shardIdentity docs, so they will connect to the production cluster's config servers on startup without complaint. (We want them to instead connect to the backup cluster's config servers).
It would help in catching this issue if shards were able to verify they were part of config.shards as part of sharding initialization.
I can imagine this being tricky, though, because it might involve comparing connection strings, which we try to avoid. We can't compare shardIds, because the backup shard and production shard will have the same shardId in their shardIdentity. They will only differ in their host/port.