[SERVER-54711] Invariant on shardVersion, atClusterTime relationship for sharded write transactions Created: 22/Feb/21  Updated: 29/Oct/23  Resolved: 08/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Sergi Mateo Bellido
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Participants:
Linked BF Score: 59

 Description   

When a moveChunk on a collection commits, the shardVersion of the collection is advanced and a timestamp is chosen such that:

  • Reads of moved documents after that timestamp on the recipient are considered "owned"
  • Reads of moved documents after that timestamp on the donor are considered "orphaned"

Also note that a donor does not eagerly delete or mark its orphans in a way that would generate a storage engine write conflict. The orphans are eventually deleted at some arbitrary point in the future.

Updating documents that are concurrently participating in a moveChunk result in those updates being transferred ("xfer mods") from the donor -> recipient. Those updates must be applied in a critical section before the moveChunk can be committed.

Sharded transactions, however operate against a snapshot defined by the atClusterTime attached by a mongoS. Performing a transactional sharded update on a document that's concurrently being moved with a stale atClusterTime can result in write skew and lost data.

This correctness problem is protected by the use of a shard version. But first consider this scenario that uses a stale read timestamp without a shard versioning protocol:

| Client                                                 | Config                              | Shard0            | Shard1            |
|--------------------------------------------------------+-------------------------------------+-------------------+-------------------|
| Insert A                                               |                                     | Insert A @ TS(10) |                   |
|                                                        | MoveChunkStarts                     |                   |                   |
|                                                        |                                     |                   | Insert A @ TS(20) |
|                                                        | MoveChunkCommits ValidAfter: TS(30) |                   |                   |
| Delete A (broadcast transaction) atClusterTime: TS(15) |                                     | Delete A @ 40     | `A` Not Found     |

In this example, a client inserts and deletes a document and received a success for both requests. However the document still remains as it wasn't deleted from the proper shard.

In this example, the move chunk committing generates a new shard version which we'll call SV. If the client request is operating with an atClusterTime: TS(15), but also claims to be using the fresh shard version: SV, the same bug would manifest.

Thus we can conclude there is a relationship that must be maintained for correctness of the system with chunk migrations and transactions that do updates:
For a shard receiving a startTransaction request with a shardVersion SV, the request's atClusterTime must be at least the validAfter timestamp associated SV.


Generated at Thu Feb 08 05:34:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.