[SERVER-65766] ShardingStateRecovery makes remote calls to config server while holding the RSTL Created: 18/Apr/22  Updated: 01/Jul/22  Resolved: 01/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jason Chan Assignee: Jordi Serra Torrens
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-60110 Get rid of ShardingStateRecovery once... Closed
Related
is related to SERVER-38409 Shard can crash at step-up due to Fai... Closed
is related to SERVER-56756 Primary cannot stepDown when experien... Closed
Operating System: ALL
Sprint: Sharding EMEA 2022-05-02, Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13, Sharding EMEA 2022-06-27, Sharding EMEA 2022-07-11
Participants:
Linked BF Score: 137

 Description   

Currently, in stepUp while holding the RSTL, we recover the sharding state which includes making remote calls to the config server. This seems non-ideal and a liveness issue as we end up blocking while something has gone wrong with the config server.



 Comments   
Comment by Jordi Serra Torrens [ 01/Jul/22 ]

As pointed out by the above comment, ShardingStateRecovery does this by design. However, SERVER-60110 will phase-out ShardingStateRecovery.

Comment by Jason Chan [ 18/Apr/22 ]

max.hirschhorn@mongodb.com pointed me to SERVER-38409 which implies it is by design to never complete stepUp in the case that the config server primary is unavailable. However, it seemed reasonable to bring this up again in case something has changed. We added an fassert in SERVER-56756 to fassert if any thread fails to acquire the RSTL after 15 seconds and we started started seeing consistent Antithesis failures after that change. From Replication's point of view, it seems reasonable to say stepUp should never take longer than 15 seconds.

Generated at Thu Feb 08 06:03:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.