Details
-
Task
-
Resolution: Fixed
-
Major - P3
-
None
-
Fully Compatible
-
59
Description
When a moveChunk on a collection commits, the shardVersion of the collection is advanced and a timestamp is chosen such that:
- Reads of moved documents after that timestamp on the recipient are considered "owned"
- Reads of moved documents after that timestamp on the donor are considered "orphaned"
Also note that a donor does not eagerly delete or mark its orphans in a way that would generate a storage engine write conflict. The orphans are eventually deleted at some arbitrary point in the future.
Updating documents that are concurrently participating in a moveChunk result in those updates being transferred ("xfer mods") from the donor -> recipient. Those updates must be applied in a critical section before the moveChunk can be committed.
Sharded transactions, however operate against a snapshot defined by the atClusterTime attached by a mongoS. Performing a transactional sharded update on a document that's concurrently being moved with a stale atClusterTime can result in write skew and lost data.
This correctness problem is protected by the use of a shard version. But first consider this scenario that uses a stale read timestamp without a shard versioning protocol:
| Client | Config | Shard0 | Shard1 |
|
|--------------------------------------------------------+-------------------------------------+-------------------+-------------------|
|
| Insert A | | Insert A @ TS(10) | |
|
| | MoveChunkStarts | | |
|
| | | | Insert A @ TS(20) |
|
| | MoveChunkCommits ValidAfter: TS(30) | | |
|
| Delete A (broadcast transaction) atClusterTime: TS(15) | | Delete A @ 40 | `A` Not Found |
|
In this example, a client inserts and deletes a document and received a success for both requests. However the document still remains as it wasn't deleted from the proper shard.
In this example, the move chunk committing generates a new shard version which we'll call SV. If the client request is operating with an atClusterTime: TS(15), but also claims to be using the fresh shard version: SV, the same bug would manifest.
Thus we can conclude there is a relationship that must be maintained for correctness of the system with chunk migrations and transactions that do updates:
For a shard receiving a startTransaction request with a shardVersion SV, the request's atClusterTime must be at least the validAfter timestamp associated SV.