Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-61976

[Resharding] Shards can error while refreshing their shard version following step-up, stalling the resharding operation

    • Fully Compatible
    • ALL
    • v5.2, v5.1, v5.0
    • Sharding 2021-12-27
    • 34
    • 2

      On step-up, shards will clear the filtering metadata and schedule a shard version refresh for the source collection and the temporary resharding collection. It is possible for the shard version refresh triggered through onShardVersionMismatch(..., boost::none /* shardVersionReceived */) to error and not complete the shard version refresh. This can leave a recipient shard waiting to learn all donor shards are prepared to donate or can leave a donor shard waiting to learn all recipient shards have finished cloning.

      Shards must reattempt calling onShardVersionMismatch() until it succeeds to ensure forward progress for the DonorStateMachines and RecipientStateMachines. Wrapping the call in an AsyncTry is the likely implementation solution.

      [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I  SHARDING 22720   [ShardServerCatalogCacheLoader::getChunksSince] "Command failed with a retryable error and will be retried","attr":{"command":{"_flushRoutingTableCacheUpdates":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709"},"error":"InterruptedDueToReplStateChange: operation was interrupted"}
      [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I  SH_REFR  4619903 [CatalogCache-0] "Error refreshing cached collection","attr":{"namespace":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709","durationMillis":336,"error":"InterruptedDueToReplStateChange: operation was interrupted"}
      ...
      [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.287+00:00 W  RESHARD  5498101 [TriggerReshardingRecovery] "Error on deferred shardVersion recovery execution","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
      ...
      [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:23.137+00:00 I  REPL     5123005 [ReshardingRecipientService-2] "Rebuilding PrimaryOnlyService due to stepUp","attr":{"service":"ReshardingRecipientService"}
      

            Assignee:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: