[SERVER-39205] Force the cache refresh after the shard is removed. Created: 25/Jan/19  Updated: 29/Oct/23  Resolved: 19/Feb/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Bug Priority: Major - P3
Reporter: Martin Neupauer Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-39665 out_with_drop_shard.js should not nee... Closed
is related to SERVER-32871 ReplicaSetMonitorRemoved and ShardNot... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 12

 Comments   
Comment by Githook User [ 19/Feb/19 ]

Author:

{'name': 'Martin Neupauer', 'username': 'MartinNeupauer', 'email': 'martin.neupauer@mongodb.com'}

Message: SERVER-39205 Force the cache refresh after the shard is removed.
Branch: master
https://github.com/mongodb/mongo/commit/854d9b63d4a5d8c164636d8543ab31eacdcdb3ee

Comment by Esha Maharishi (Inactive) [ 13/Feb/19 ]

The theory charlie.swanson and I came up with last Friday for why the test only fails sometimes without theĀ flushRouterConfig is that the $out, which is run in a parallel shell, races with the currentOp.

If theĀ $out attempts to establish the cursor on the removed shard before the currentOp forces a refresh of the ShardRegistry, the $out succeeds in targeting the removed shard and gets back a StaleShardVersion from it. The StaleShardVersion causes the router's CatalogCache to be refreshed and the $out to be retried, and the $out retry with the fresh cache succeeds (and the test passes).

If the currentOp forces the ShardRegistry refresh first, then the $out gets ShardNotFound on its first attempt to establish the cursor, and the $out fails (and test fails).

We should confirm this theory by putting a several-second sleep just before the establishCursors line and verifying that the test always fails without the flushRouterConfig but always passes with the flushRouterConfig.

We can then add the flushRouterConfig to the test, preferably with a comment explaining the race induced by the test.

CC martin.neupauer

Generated at Thu Feb 08 04:51:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.