-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
59
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When a moveChunk on a collection commits, the shardVersion of the collection is advanced and a timestamp is chosen such that:
- Reads of moved documents after that timestamp on the recipient are considered "owned"
- Reads of moved documents after that timestamp on the donor are considered "orphaned"
Also note that a donor does not eagerly delete or mark its orphans in a way that would generate a storage engine write conflict. The orphans are eventually deleted at some arbitrary point in the future.
Updating documents that are concurrently participating in a moveChunk result in those updates being transferred ("xfer mods") from the donor -> recipient. Those updates must be applied in a critical section before the moveChunk can be committed.
Sharded transactions, however operate against a snapshot defined by the atClusterTime attached by a mongoS. Performing a transactional sharded update on a document that's concurrently being moved with a stale atClusterTime can result in write skew and lost data.
This correctness problem is protected by the use of a shard version. But first consider this scenario that uses a stale read timestamp without a shard versioning protocol:
| Client | Config | Shard0 | Shard1 | |--------------------------------------------------------+-------------------------------------+-------------------+-------------------| | Insert A | | Insert A @ TS(10) | | | | MoveChunkStarts | | | | | | | Insert A @ TS(20) | | | MoveChunkCommits ValidAfter: TS(30) | | | | Delete A (broadcast transaction) atClusterTime: TS(15) | | Delete A @ 40 | `A` Not Found |
In this example, a client inserts and deletes a document and received a success for both requests. However the document still remains as it wasn't deleted from the proper shard.
In this example, the move chunk committing generates a new shard version which we'll call SV. If the client request is operating with an atClusterTime: TS(15), but also claims to be using the fresh shard version: SV, the same bug would manifest.
Thus we can conclude there is a relationship that must be maintained for correctness of the system with chunk migrations and transactions that do updates:
For a shard receiving a startTransaction request with a shardVersion SV, the request's atClusterTime must be at least the validAfter timestamp associated SV.