[SERVER-61052] Resharding Donor & Recipient's Coordinator Doc Updates Can Time Out Waiting for Replication on Coordinator Doc, Leading to Fatal Assertion Created: 27/Oct/21  Updated: 06/Nov/23  Resolved: 29/Oct/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.1.0-rc2
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc3

Type: Bug Priority: Major - P3
Reporter: Luis Osta (Inactive) Assignee: Luis Osta (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-57686 We need test coverage that runs resha... Closed
Related
related to SERVER-82838 ReshardingOplogApplier uses {w: "majo... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-11-01
Participants:
Story Points: 1

 Description   

Cause
If the update to the config server when updating the coordinator document takes longer than the specified wTimeout, the command will fatally error.

Context
Both the ReshardingRecipientService and ReshardingDonorService need to update the coordinator document in the config server with the latest state of the donor and recipient accordingly. While both attempt to retry errors that occur, not every error is retryable. One of these non-retryable errors is WriteConcernFailed, which can be generated from write timeout errors.

Problem
The commands used by the resharding components are designed to simply wait for the results of their command, so we shouldn't throw an error if the write is taking a long time.

In the ReshardingDonorService, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

{w: "majority"}

specified with no wtimeout.*

In the RecipientStateMachineExternalStateImpl, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

{w: "majority"}

specified with no wtimeout.*

 

Source
You can see the following log lines in this Evergreen Patch

756555:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"RESHARD",  "id":5160600, "ctx":"ReshardingDonorService-1","msg":"Unrecoverable error occurred past the point donor was prepared to complete the resharding operation","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 60000, provenance: \"clientSupplied\" } }"}}
756556:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingDonorService-1","msg":"Fatal assertion","attr":{"msgid":5160600,"file":"src/mongo/db/s/resharding/resharding_donor_service.cpp","line":459}}
756557:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.275+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingDonorService-1","msg":"\n\n***aborting after fassert() failure\n\n"}



 Comments   
Comment by Githook User [ 29/Oct/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61052 Remove wtimeout from resharding write concern
Branch: v5.1
https://github.com/mongodb/mongo/commit/1623ffcd7b9465796b0d03ce8ec5647975e3273e

Comment by Githook User [ 29/Oct/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61052 Remove wtimeout from resharding write concern
Branch: v5.0
https://github.com/mongodb/mongo/commit/033b5745d2b3a15d2a5ec7aa2894339a3c6158f4

Comment by Githook User [ 29/Oct/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-61052 Remove wtimeout from resharding write concern
Branch: master
https://github.com/mongodb/mongo/commit/116469f19e778e69c42fd868815f36e2f455cf36

Generated at Thu Feb 08 05:51:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.