[SERVER-47359] ShardRegistry reload can race with RSM updates to ShardRegistry Created: 06/Apr/20  Updated: 29/Oct/23  Resolved: 15/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc2, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Janna Golden Assignee: Janna Golden
Resolution: Fixed Votes: 0
Labels: KP44
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-04-20
Participants:
Linked BF Score: 115

 Description   

During startup, the ShardRegistry stores all nodes in the initial config server string and then schedules its initial reload and the RSM begins to monitor the config server. During start up, if the RSM gets a topology change of type ReplicaSetNoPrimary, (meaning its only heard back from secondaries), it will update the ShardRegistry with this info and remove any nodes that are not yet confirmed. The following set of steps can lead to a ShardNotFound error:

1. ShardRegistry starts up with all config server nodes passed in the initial connection string.
2. RSM starts up and begins monitoring the config server.
3. RSM hears back from a secondary node (let's say node 1), the TopologyChangeEvent can look something like [node 0: type unknown, node 1: type secondary, node 2: type unknown]. The RSM notifies the ShardRegistry with this info. This will cause the ShardRegistry to remove node 0 and node 2 from the ShardRegistryData object `_data` owned by the ShardRegistry. (Specifically it removes hosts from the ShardRegistryData's `_hostLookup` and `_lookup` maps here)
4. ShardRegistry's initial refresh starts. It constructs a new ShardRegsitryData object which at this moment only has node 1 for the config server.
5. RSM gets a TopologyChangeEvent from the primary which can look something like [node 0: type primary, node 1: type secondary, node 2: type secondary]. The RSM now thinks each of these nodes is able to be targeted. It sends an update to the ShardRegistry with this info.
6. The ShardRegistry adds each of the nodes the RSM notified it about to the `_data` object.
7. The ShardRegistry reload swaps out it's `_data` object that was just updated in step 6 with the ShardRegistryData object in constructed in step 4.
8. The RSM thinks node 0 is able to be targeted, so returns it to some command to target. The ShardingNetworkConnectionHook then tries to target node 0 and grab it from the ShardRegistry, but it does not exist anymore so throws ShardNotFound.



 Comments   
Comment by Githook User [ 16/Apr/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-47359 ShardRegistry reload can race with RSM updates to ShardRegistry

(cherry picked from commit 2ecb450c1a48775953160b50ae0ac6e9f0ae5723)
Branch: v4.4
https://github.com/mongodb/mongo/commit/3ef8b9d849ce65cf1deb9f37d9d3b1a2741864b9

Comment by Githook User [ 15/Apr/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-47359 ShardRegistry reload can race with RSM updates to ShardRegistry
Branch: master
https://github.com/mongodb/mongo/commit/2ecb450c1a48775953160b50ae0ac6e9f0ae5723

Generated at Thu Feb 08 05:13:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.