[SERVER-40791] Chunk migration clone blocks behind prepared transactions Created: 23/Apr/19 Updated: 29/Oct/23 Resolved: 20/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.0-rc2, 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Blake Oler | Assignee: | Blake Oler |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||||||||||||||
| Sprint: | Sharding 2019-06-03, Sharding 2019-06-17, Sharding 2019-07-01 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||
| Description |
|
The current iteration of chunk migration was built with the assumption that cloning can happen in parallel with prepared transactions on the collection, however this is not what happens in practice. As part of default command behavior, the moveChunk command will block on reads for documents that are in the prepare state. If any documents in the initial index scan are in prepared transactions, the index scan will block on the completion of those transactions to read these documents. This means that the beginning of the chunk migration clone phase will infinitely block in the presence of transactions that:
This ticket is to evaluate if this behavior is acceptable, and if not, figure out a way around this behavior. If we decide to allow the moveChunk command to ignore prepare conflicts, then we will need additional machinery to track these prepared transactions. |
| Comments |
| Comment by Githook User [ 24/Jun/19 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: (cherry picked from commit 35424844fd9e10b042c435c83a8f1e23e42fb9e4) |
| Comment by Githook User [ 20/Jun/19 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: |
| Comment by Blake Oler [ 13/Jun/19 ] |
|
Reverted because the current usage of ignoring prepare conflicts doesn't allow us to write to the local subsystem. Investigating writes that are happening and will repush with proper behavior. |
| Comment by Githook User [ 13/Jun/19 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: Revert " This reverts commit dfa8658c18142c560447c7bf6f34a6f788593d28. |
| Comment by Githook User [ 12/Jun/19 ] |
|
Author: {'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}Message: |
| Comment by Siyuan Zhou [ 30/Apr/19 ] |
|
I investigated transaction_participant.cpp, I think we could add a new opObserver function since storage transaction commit can never fail, but I'm concerned that it requires the OpObserver and its caller to know a lot of details of each other, violating the abstraction of observer pattern and making the maintenance of OpObserver harder in the future. Another idea is to move onPreparedTransactionCommit() before committing the storage transaction. Since we don't allow storage transaction commit to fail, both data WT transaction and oplog WT transaction will commit. To make sure the oplog write isn't visible before the data change so that afterCluterTime read are respected, we need to reserve an OplogSlot without using it and release it after both data and oplog writes commit as we will do in The third idea is to release the recovery unit and the locker from the WUOW and manage them manually on prepared commit, so that we only commit the recovery unit in _commitStorageTransaction() instead of committing the WUOW and have the locker outlive both data and oplog writes. We'll need storage team to weigh in about this. I'd prefer Blake's second solution to always add recovery unit onCommit handlers on multi-statement transaction CRUD ops if we can avoid unowned BSON copies, since when onCommit() is called on recovery unit, it's the right time to check committed data by design, avoiding blurring the boundary of sharding and transaction as we saw in other alternatives. |
| Comment by Blake Oler [ 30/Apr/19 ] |
|
After some in-person discussion with siyuan.zhou, kaloian.manassiev, renctan, and myself, we have decided on the following: We want to be know the amount and complexity of work it would take to allow migrations to start cloning while ignoring prepare conflicts.
After our separate investigations, we will decide which approach is the better approach to pursue. |