[SERVER-40815] Updating the shard key can conflict with in-progress migrations Created: 24/Apr/19  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 1
Labels: sharding-common-backlog, stop-orphaning-fallout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js    
Issue Links:
Related
related to SERVER-40483 Changing the shard key could lead to ... Closed
Assigned Teams:
Cluster Scalability
Operating System: ALL
Participants:

 Description   

If you update a document's shard key such that the document would get moved from Shard A to Shard B, this gets turned into a delete from Shard A and an insert to Shard B, and if this happens at the same time as a migration which has already migrated the same document from Shard A to Shard B but which has not committed, the insert to Shard B can receive a DuplicateKeyError on _id due to the "orphan" version of the document that was already there due to the migration.

A detailed scenario:

  1. Shard a collection on field shardKey, and move chunks such that Shard 0 contains Chunk A with range [shardKey: $minKey, shardKey: 0] and Shard 1 contains Chunk B with range (shardKey: 0, shardKey: $maxKey]
  2. Insert a document { _id: 0, shardKey: -100}

    , which is inserted into Chunk A on Shard 0

  3. Begin migrating Chunk A from Shard 0 to Shard 1, but pause before entering the critical section. The document will now be present on Shard 1 in the storage engine, but the chunk migration will not yet have committed.
  4. Start a transaction
  5. Do a delete with query { _id: 0, shardKey: -100}

    . This will execute on Shard 0 since the migration has not yet committed.

  6. Insert { _id: 0, shardKey: 1000}

    . This would target Chunk B on Shard 1. This fails with a DuplicateKeyError on _id, since the original document {_id: 0, shardKey: -100} exists in the storage engine on shard 1 because of the migration.

Note that this is not specific to handling shard key updates internally, and can happen even if the user just decides to update the shard key on their own with a transaction, as described in the detailed scenario above.



 Comments   
Comment by Sheeri Cabral (Inactive) [ 20/Feb/20 ]

The workaround for this for users is "try again", because as soon as the migration completes, the update will work. This orphan is a transitive orphan, not a permanent one.

This can be a confusing error message for users, however, especially if it's part of a batch write. There's no easy solution here

Generated at Thu Feb 08 04:56:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.