[SERVER-47169] Sharding initialization contacts config shard before ShardRegistry updated by RSM, preventing mongos from starting up Created: 28/Mar/20 Updated: 29/Oct/23 Resolved: 01/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc0, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Haley Connelly |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Apply the following patch to have mongos delay updating the ShardRegistry after the listener is notified about a confirmed replica set.
|
||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2020-04-06 | ||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 27 | ||||||||||||||||||||||||||||||||||||||
| Description |
|
The ShardingNetworkConnectionHook causes a ShardNotFound error status to be returned if the HostAndPort isn't found in the ShardRegistry. This hook is run after a connection to the remote host has been established.
The connection string for config shard may be updated while the sharding subsystem is initializing. (For reasons I still don't quite understand, this doesn't happen every time mongos is started, but I believe it is a necessary condition for the issue reported here to manifest.) Updating the connection string upon receiving isMaster responses from secondaries of the config shard (where the primary is still seen by the RSM as "Unknown") would remove the HostAndPort for the primary from ShardRegistry::_hostLookup. Re-adding the HostAndPort for the primary to ShardRegistry::_hostLookup happens as part of ShardingReplicaSetChangeListener::onConfirmedSet() by scheduling a task on the fixed executor. Since the ShardRegistry::_hostLookup map isn't updated synchronously, it is possible for the RSM to view the now-confirmed primary as being available for targeting primary-only reads, but for the post-connection established validate hook to fail. This leads to mongos being unable to start up successfully.
|
| Comments |
| Comment by Githook User [ 02/Apr/20 ] |
|
Author: {'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}Message: (cherry picked from commit 08351c1b12f3ca5c9ab99b6628e27d2083278011) |
| Comment by Githook User [ 01/Apr/20 ] |
|
Author: {'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}Message: |
| Comment by Max Hirschhorn [ 30/Mar/20 ] |
|
Lamont, Janna, Haley, and I discussed this issue today. The current plan is to try and change ShardingReplicaSetChangeListener::onConfirmedSet() in both mongod and mongos so that ShardRegistry::updateReplSetHosts() is called synchronously, i.e. outside the task being scheduled on the fixed executor. (We'll still want to schedule a separate task for updating the contents of the config database.) Doing so would ensure that if getHostOrRefresh() would resolve to a HostAndPort, then the ShardRegistry won't error after connecting to it as a result of ShardingNetworkConnectionHook::validateHostImpl(). |
| Comment by Max Hirschhorn [ 28/Mar/20 ] |
|
Based on some of the additional context I added to these error statuses, it appears pre-caching the routing table in mongos on startup ( I've tentatively marked this as an RC0 blocker because I feel it is something the automation team is likely to run into. |