[SERVER-48500] _shardsvrShardCollection after manual intervention can succeed without writing chunks Created: 29/May/20  Updated: 29/Oct/23  Resolved: 31/May/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Fixed Votes: 0
Labels: sharding-causes-bfs-hard
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 15

 Description   

 The following sequence of events can lead _shardsvrShardCollection to return ok:1 without actually writing chunks if a user follows the recommended procedure for handling a ManualInterventionRequired error:

  1. Config primary sends _shardsvrShardCollection to primary shard primary node
  2. Primary shard writes chunks to config.chunks with majority write concern
  3. Primary shard steps down immediately after sending config.collections update to config server
  4. Config primary retries _shardsvrShardCollection on new primary shard (because of this idempotent retry policy) primary node before the config.collections update arrives or is majority committed
  5. Primary shard reads from config.collections with majority read concern and continues with sharding the collection because it does not see the config.collections write
  6. Primary shard throws ManualInterventionRequired when it finds chunks already exist for the namespace (from the first attempt)
  7. A user deletes the namespace's chunks before retrying shardCollection
  8. Config primary sends _shardsvrShardCollection to primary shard primary node
  9. Primary shard reads config.collections after the write from the first attempt is majority committed, assumes the collection is sharded, and returns ok:1, leaving the collection without any chunks

Generated at Thu Feb 08 05:17:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.