[SERVER-31372] Change shardCollection to use Shard::runCommand Created: 03/Oct/17 Updated: 06/Dec/22 Resolved: 29/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | neweng, sharding-interns-2019 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding
|
||||
| Sprint: | Sharding 2019-06-17, Sharding 2019-07-01, Sharding 2019-07-15, Sharding 2019-07-29, Sharding 2019-08-12 | ||||
| Participants: | |||||
| Linked BF Score: | 0 | ||||
| Description |
|
Currently the shardCollection command sends _configsvrShardCollection to the config server primary using Shard::runCommandWithFixedRetryAttempts which will only retry on retryable errors (like NotMaster or InterruptedDueToReplStateChange) up to 2 times. This can lead to failures in the continuous config server stepdown suite, if shardCollection is interrupted by stepdowns enough times. Changing shardCollection to use Shard::runCommand instead will cause it to retry on retryable errors until the command succeeds or the maxTimeMS limit is reached. A lot of other mongos commands seem to use Shard::runCommandWithFixedRetryAttempts as well (like drop, addShard, and splitChunk), so it can also be investigated if it is worth changing them too. |
| Comments |
| Comment by Kaloian Manassiev [ 10/Jun/19 ] |
|
The drivers spec says that retryable writes should only be retried once and then the second failure should be passed back to the user. Requiring nodes, which perform command routing (such as the routers or the config server) to have a cap on the number of retries has its pros and cons. The pros are that we don't have to think about it and that there is an upper bound of how many times we will restart potentially expensive commands, such as shardCollection. The cons are that we might not be able to completely "mask" planned upgrades of shard cluster nodes, which take slightly longer. Given that we support 3 retries already I think that's a good trade-off and leaving it at 3 retries like it is now sounds good, just the question is what to do with the BF. |
| Comment by Blake Oler [ 05/Jun/19 ] |
|
kaloian.manassiev do we still want to do this? jack.mulrow thinks it's the right behavior for commands to stop retrying if they can't succeed after three attempts. |