[SERVER-46425] Consider increasing wtimeout for cloneCatalogData or no timeout Created: 26/Feb/20  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-32142 `movePrimary` can leave orphaned data... Closed
is related to SERVER-46424 _cloneCatalogData remote call is labe... Closed
Assigned Teams:
Cluster Scalability
Operating System: ALL
Participants:

 Description   

Current setting is majority write concern with a 60 sec wtimeout. However, the clone can potentially generate lots of writes and index builds, which can cause it to timeout waiting for replication. In the current master, _movePrimary will attempt to retry because writeConcern errors are treated as a retryable error, but since the collections were already cloned already earlier, it will get a namespace already exists error, which is not retryable and causing the entire _movePrimary command to fail. This can lead to the data ending up as orphans and eventually causing issue described in SERVER-32142


Generated at Thu Feb 08 05:11:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.