[SERVER-46131] Fix race in retry loops in MigrationSourceManager Created: 13/Feb/20  Updated: 29/Oct/23  Resolved: 14/Feb/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.3.4

Type: Task Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Esha Maharishi (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-44771 Allow operations in transactions to s... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2020-02-24
Participants:

 Description   

First, at least the refreshFilteringMetadataUntilSuccess loop is racy when used to test a stepdown while hanging in the failpoint in the loop, because the failpoint causes the loop to enter an interruptible sleep. The sleep is interruptible because an OperationContext is passed. Since the OperationContext was used to take strong locks as part of forceShardFilteringMetadataRefresh (all the AutoGetDb/AutoGetCollection in here), the OperationContext gets interrupted by the stepdown, and immediately enters the catch block (even before the failpoint is turned off). The race is that the stepdown may not have updated the memberState and term yet, so this assertion passes and loop starts a second iteration, rather than failing on the first iteration.

If the above race happens and the loop starts a second iteration, then the migration_coordinator_failover.js test "accidentally" passes if the same node is elected primary because of a bug in the ShardServerCatalogCacheLoader (SERVER-45646). This bug causes the forceShardFilteringMetadataRefresh in the second iteration of the loop to throw NetworkInterfaceExceededTimeLimit, and therefore the catch block is entered again and checks the assertion again, this time after the member state has been updated.

We can fix this by avoiding the first race by making the failpoint use an uninterruptible sleep when being used to pause the thread in order to induce a stepdown.



 Comments   
Comment by Githook User [ 13/Feb/20 ]

Author:

{'username': 'EshaMaharishi', 'name': 'Esha Maharishi', 'email': 'esha.maharishi@mongodb.com'}

Message: SERVER-46131 Fix race in retry loops in MigrationSourceManager
Branch: master
https://github.com/mongodb/mongo/commit/fa4944cf85dbf40319c61e9c286c4f37c07fd084

Generated at Thu Feb 08 05:10:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.