Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71026

Find better solution to check for stale config during step up (catalog shard POC)

    • Sharding NYC

      The problem is related to SERVER-70850 and SERVER-70487. The targeted scenario is when the chunk migration is interrupted by simultaneous donor and config server step down (read catalog shard). In this case the recipient will be stuck waiting for the interruption from critical section.

      The donor responsibility is to send _configsvrEnsureChunkVersionIsGreaterThan to the config server and then _recvChunkReleaseCritSec to the recipient. Until the recipient is interrupted with this _recvChunkReleaseCritSec command it will remain deadlocked with config server's Balancer starting and waiting for all move chunk participants to exit the critical section, while the recipient will not exit the critical section until something tells it. This may prevent the Balancer to be stuck in the init() method for minutes.

      The attached fix is not super clean. It amends the PeriodicShardedIndexConsistencyChecker::onStepUp() the following waiy:

      if (serverGlobalParams.clusterRole == ClusterRole::CatalogShard &&
                  _shardedIndexConsistencyChecker.isValid()) {

      The reason is that the Index consistency checker is scanning all collections and generates the StaleConfigInfo on collection which shard is stuck in critical section. Then this error will make the _recvChunkReleaseCritSec to be sent to the recipient. I think there should be a cleaner way to force the check to generate the StaleConfigInfo. The one I did works but is not clean.

            backlog-server-sharding-nyc [DO NOT USE] Backlog - Sharding NYC
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            0 Vote for this issue
            3 Start watching this issue