[SERVER-44369] shardCollection can fail on config server failover if primary shard finishes _shardsvrShardCollection before the stepdown thread kills ops Created: 01/Nov/19  Updated: 29/Oct/23  Resolved: 31/May/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.0

Type: Bug Priority: Major - P3
Reporter: Janna Golden Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Fixed Votes: 0
Labels: sharding-DDL-bugs, sharding-csrs-stepdown-also
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Participants:
Linked BF Score: 21

 Description   

A shardCollection command can fail in the following scenario:
1. Config server primary sends _shardsvrShardColection to primary shard
2. The stepdown thread starts running on the config server
3. _shardsvrShardCollection writes to config.chunks and config.collections on the new primary config server
4. _shardsvrShardCollection finishes and returns back to the original config primary before the stepdown thread began killing operations, so the config server will read a stale routing table

In this case, the primary shard wrote the new chunks to config.chunks and marked the collection as sharded in config.collections successfully on the new primary config, so a user can retry and the command should succeed immediately.


Generated at Thu Feb 08 05:05:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.