[SERVER-61976] [Resharding] Shards can error while refreshing their shard version following step-up, stalling the resharding operation Created: 09/Dec/21  Updated: 29/Oct/23  Resolved: 21/Dec/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.1.0, 5.2.0-rc0
Fix Version/s: 5.3.0, 5.1.2, 5.0.6, 5.2.0-rc5

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.2, v5.1, v5.0
Sprint: Sharding 2021-12-27
Participants:
Linked BF Score: 34
Story Points: 2

 Description   

On step-up, shards will clear the filtering metadata and schedule a shard version refresh for the source collection and the temporary resharding collection. It is possible for the shard version refresh triggered through onShardVersionMismatch(..., boost::none /* shardVersionReceived */) to error and not complete the shard version refresh. This can leave a recipient shard waiting to learn all donor shards are prepared to donate or can leave a donor shard waiting to learn all recipient shards have finished cloning.

Shards must reattempt calling onShardVersionMismatch() until it succeeds to ensure forward progress for the DonorStateMachines and RecipientStateMachines. Wrapping the call in an AsyncTry is the likely implementation solution.

[js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I  SHARDING 22720   [ShardServerCatalogCacheLoader::getChunksSince] "Command failed with a retryable error and will be retried","attr":{"command":{"_flushRoutingTableCacheUpdates":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709"},"error":"InterruptedDueToReplStateChange: operation was interrupted"}
[js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I  SH_REFR  4619903 [CatalogCache-0] "Error refreshing cached collection","attr":{"namespace":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709","durationMillis":336,"error":"InterruptedDueToReplStateChange: operation was interrupted"}
...
[js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.287+00:00 W  RESHARD  5498101 [TriggerReshardingRecovery] "Error on deferred shardVersion recovery execution","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
...
[js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:23.137+00:00 I  REPL     5123005 [ReshardingRecipientService-2] "Rebuilding PrimaryOnlyService due to stepUp","attr":{"service":"ReshardingRecipientService"}



 Comments   
Comment by Githook User [ 07/Jan/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61976 Retry failed shard version refreshes on step up

On step-up, shards will clear the filtering metadata and schedule a
shard version refresh for the source collection and the temporary
resharding collection. It is possible for the shard version refresh
triggered through onShardVersionMismatch() to error and not complete the
shard version refresh. This can leave a recipient shard waiting to learn
all donor shards are prepared to donate or can leave a donor shard
waiting to learn all recipient shards have finished cloning.

Therefore, shards now will retry on errors until the refresh
successfully completes.

(cherry picked from commit 70417bcbe6ca27b9e20455de5e77313ef68c648a)
(cherry picked from commit 00591f7a441e452d70af288a4376272a52fcd638)
Branch: v5.0
https://github.com/mongodb/mongo/commit/7f6d0bad957c5b538538b41a41a16102ee71357d

Comment by Githook User [ 07/Jan/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61976 Retry failed shard version refreshes on step up

On step-up, shards will clear the filtering metadata and schedule a
shard version refresh for the source collection and the temporary
resharding collection. It is possible for the shard version refresh
triggered through onShardVersionMismatch() to error and not complete the
shard version refresh. This can leave a recipient shard waiting to learn
all donor shards are prepared to donate or can leave a donor shard
waiting to learn all recipient shards have finished cloning.

Therefore, shards now will retry on errors until the refresh
successfully completes.

(cherry picked from commit 70417bcbe6ca27b9e20455de5e77313ef68c648a)
(cherry picked from commit 00591f7a441e452d70af288a4376272a52fcd638)
Branch: v5.2
https://github.com/mongodb/mongo/commit/6dca8bdd52c435df69daa58ee173b2aad512c950

Comment by Githook User [ 07/Jan/22 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61976 Retry failed shard version refreshes on step up

On step-up, shards will clear the filtering metadata and schedule a
shard version refresh for the source collection and the temporary
resharding collection. It is possible for the shard version refresh
triggered through onShardVersionMismatch() to error and not complete the
shard version refresh. This can leave a recipient shard waiting to learn
all donor shards are prepared to donate or can leave a donor shard
waiting to learn all recipient shards have finished cloning.

Therefore, shards now will retry on errors until the refresh
successfully completes.

(cherry picked from commit 70417bcbe6ca27b9e20455de5e77313ef68c648a)
(cherry picked from commit 00591f7a441e452d70af288a4376272a52fcd638)
Branch: v5.1
https://github.com/mongodb/mongo/commit/7318c3b4f85288322d5dcb7b670a692b94c7ccd4

Comment by Githook User [ 21/Dec/21 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61976 Clarify semantics of step up shard version refresh retries

The previous commit for SERVER-61976 added an AsyncTry to retry failed
shard version refreshes on step up. That AsyncTry was using the
cancellation token from an opCtx that goes out of scope shortly after
the AsyncTry is created, and therefore will never be cancelled. Since
the AsyncTry could functionally never be cancelled anyway, this commit
makes that clear by using CancellationToken::uncancelable() instead.
Branch: master
https://github.com/mongodb/mongo/commit/00591f7a441e452d70af288a4376272a52fcd638

Comment by Githook User [ 20/Dec/21 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-61976 Retry failed shard version refreshes on step up

On step-up, shards will clear the filtering metadata and schedule a
shard version refresh for the source collection and the temporary
resharding collection. It is possible for the shard version refresh
triggered through onShardVersionMismatch() to error and not complete the
shard version refresh. This can leave a recipient shard waiting to learn
all donor shards are prepared to donate or can leave a donor shard
waiting to learn all recipient shards have finished cloning.

Therefore, shards now will retry on errors until the refresh
successfully completes.
Branch: master
https://github.com/mongodb/mongo/commit/70417bcbe6ca27b9e20455de5e77313ef68c648a

Comment by Pierlauro Sciarelli [ 14/Dec/21 ]

is it worth Sharding NYC addressing SERVER-61976 by adding an AsyncTry to resharding::clearFilteringMetadata()

I think it's better to solve the bugs under two separate tickets as they have different purposes. I am happy to review the adding of the AsyncTry if NYC has some bandwidth for it.

By the way, I think it's worth mentioning there is also a flow that doesn't interrupt the RecoverRefreshThread: in case the stepdown-stepup process happens right between the beginning of its execution and before refreshing, the operation context would not get interrupted because we would miss the onStepdown/onStepUp interruption hooks.

That seems way too unlikely to happen in case the same node is stepping down and up again since elections are not that fast. But it could totally happen in case a secondary is stepping up.

Comment by Max Hirschhorn [ 13/Dec/21 ]

Randolph pointed out that it is odd for the newly-scheduled shard version refresh on step-up to have been killed because we would have already joined the RstlKillOpThread by the time resharding::clearFilteringMetadata() function is called. Max believes that we're seeing the effects of the primary joining a shard version refresh which had been initiated while the node was still secondary prior to the step-up. ShardServerCatalogCacheLoader::onStepUp() would have interrupted the OperationContext actively running forcePrimaryCollectionRefreshAndWaitForReplication() and propagated that interruption notification back through CatalogCache::getCollectionRoutingInfoWithRefresh() and the RecoverRefreshThread's future chain. joinShardVersionOperation() would have then propagated the interruption notification back through onShardVersionMismatch() and abandoned doing a new shard version refresh after the in-progress one was killed on step-up.

pierlauro.sciarelli, kaloian.manassiev, is it worth Sharding NYC addressing SERVER-61976 by adding an AsyncTry to resharding::clearFilteringMetadata()? Or would the resolution for SERVER-61879 also solve the problem of joining refreshes getting spurious interruption notifications?

Generated at Thu Feb 08 05:53:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.