Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Gone away
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- sharding-nyc-subteam2
- sharding-nyc-subteam2-catalog-poc

Assigned Teams:

Sharding NYC
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The problem is related to ~~SERVER-70850~~ and ~~SERVER-70487~~. The targeted scenario is when the chunk migration is interrupted by simultaneous donor and config server step down (read catalog shard). In this case the recipient will be stuck waiting for the interruption from critical section.

The donor responsibility is to send _configsvrEnsureChunkVersionIsGreaterThan to the config server and then _recvChunkReleaseCritSec to the recipient. Until the recipient is interrupted with this _recvChunkReleaseCritSec command it will remain deadlocked with config server's Balancer starting and waiting for all move chunk participants to exit the critical section, while the recipient will not exit the critical section until something tells it. This may prevent the Balancer to be stuck in the init() method for minutes.

The attached fix is not super clean. It amends the PeriodicShardedIndexConsistencyChecker::onStepUp() the following waiy:

if (serverGlobalParams.clusterRole == ClusterRole::CatalogShard &&
            _shardedIndexConsistencyChecker.isValid()) {
            ...
            _shardedIndexConsistencyChecker.stop();
            _shardedIndexConsistencyChecker.detach();
            ...
        }

The reason is that the Index consistency checker is scanning all collections and generates the StaleConfigInfo on collection which shard is stuck in critical section. Then this error will make the _recvChunkReleaseCritSec to be sent to the recipient. I think there should be a cleaner way to force the check to generate the StaleConfigInfo. The one I did works but is not clean.

Assignee:: [DO NOT USE] Backlog - Sharding NYC
Reporter:: Andrew Shuvalov (Inactive)
Participants:: [DO NOT USE] Backlog - Sharding NYC, Andrew Shuvalov, Jack Mulrow
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Nov 02 2022 06:25:30 PM UTC
Updated:: Oct 27 2023 08:45:05 PM UTC
Resolved:: Feb 09 2023 10:19:24 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates