[SERVER-71026] Find better solution to check for stale config during step up (catalog shard POC) Created: 02/Nov/22  Updated: 27/Oct/23  Resolved: 09/Feb/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Gone away Votes: 0
Labels: sharding-nyc-subteam2, sharding-nyc-subteam2-catalog-poc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding NYC
Participants:

 Description   

The problem is related to SERVER-70850 and SERVER-70487. The targeted scenario is when the chunk migration is interrupted by simultaneous donor and config server step down (read catalog shard). In this case the recipient will be stuck waiting for the interruption from critical section.

The donor responsibility is to send _configsvrEnsureChunkVersionIsGreaterThan to the config server and then _recvChunkReleaseCritSec to the recipient. Until the recipient is interrupted with this _recvChunkReleaseCritSec command it will remain deadlocked with config server's Balancer starting and waiting for all move chunk participants to exit the critical section, while the recipient will not exit the critical section until something tells it. This may prevent the Balancer to be stuck in the init() method for minutes.

The attached fix is not super clean. It amends the PeriodicShardedIndexConsistencyChecker::onStepUp() the following waiy:

if (serverGlobalParams.clusterRole == ClusterRole::CatalogShard &&
            _shardedIndexConsistencyChecker.isValid()) {
            ...
            _shardedIndexConsistencyChecker.stop();
            _shardedIndexConsistencyChecker.detach();
            ...
        }

The reason is that the Index consistency checker is scanning all collections and generates the StaleConfigInfo on collection which shard is stuck in critical section. Then this error will make the _recvChunkReleaseCritSec to be sent to the recipient. I think there should be a cleaner way to force the check to generate the StaleConfigInfo. The one I did works but is not clean.



 Comments   
Comment by Jack Mulrow [ 09/Feb/23 ]

Issue in the POC implementation that has not been seen since, so closing as gone away.

Generated at Thu Feb 08 06:17:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.