-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
In general, resharding uses wtimeout: 0 when passing in the write concern options. However, catalog cache refreshes sometimes force a no-op write, which uses a wtimeout of 60 seconds. If a resharding service gets a wc timeout, it will treat this error as fatal. This was seen in AF-1904, and and fixed in SERVER-102452 for the specific case hit in that ticket. This ticket is for doing the follow up work to make all resharding services retry on WCEs in general - we should be able to do this by making the kRetryabilityPredicateIncludeWriteConcernTimeout added in SERVER-102452 the default for resharding::withAutomaticRetry. As a part of doing this, we should audit the idempotency of anything retried by resharding::withAutomaticRetry - SERVER-102452 found that we would reset at least some in-memory state (like metrics), when we really shouldn't be.
- is related to
-
SERVER-102452 Make ReshardingDonorService retry on WriteConcernFailure when finishing
-
- In Code Review
-
- related to
-
SERVER-92936 Writes with a {w: majority, wtimeout:0} fail in case of network errors.
-
- Backlog
-
-
SERVER-104258 Resharding Can Hang If Recipient Fails During ShardsvrReshardRecipientClone
-
- Backlog
-
-
SERVER-102452 Make ReshardingDonorService retry on WriteConcernFailure when finishing
-
- In Code Review
-