LockTimeout error on a resharding recipient during "creating-collection" or "building-index" state causes the entire resharding operation to abort

XMLWordPrintableJSON

    • Cluster Scalability
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      The steps in the "creating-collection"state and "building-index" state, namely _createTemporaryReshardingCollectionThenTransitionToCloning and _buildIndexThenTransitionToApplying, do not their own retry logic. So they rely on the top-level retry logic in _runUntilStrictConsistencyOrErrored which uses the primary_only_service_helpers::kDefaultRetryabilityPredicate which does not include LockTimeout error. So when a LockTimeout error occurs, e.g. while creating the temporary collection here and here and while creating the indexes here, the error would cause the entire resharding operation to fail.

      Currently, LockTimeout is considered as a retryable error by the ShardingDDLCoordinator (ReshardColllectionCoordinator on the primary shard) because it is an Interruption error so after the resharding operation aborts, the ShardingDDLCoordinator would retry the _configsvrReshardCollection command which would initiate a new resharding operation. However, it is still very not user-friendly for resharding to need to start over just because of a LockTimeout error. 

            Assignee:
            Kruti Shah
            Reporter:
            Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: