[SERVER-40791] Chunk migration clone blocks behind prepared transactions Created: 23/Apr/19  Updated: 29/Oct/23  Resolved: 20/Jun/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.2.0-rc2, 4.3.1

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File prepare_transaction_then_migrate.js    
Issue Links:
Backports
Depends
Related
related to SERVER-68361 LogTransactionOperationsForShardingHa... Closed
related to SERVER-78414 Recipient shard in chunk migration ca... Closed
related to SERVER-78415 Avoid sending unrelated operations fr... Backlog
is related to SERVER-71028 MigrationChunkClonerSourceLegacy::nex... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Sharding 2019-06-03, Sharding 2019-06-17, Sharding 2019-07-01
Participants:
Linked BF Score: 0

 Description   

The current iteration of chunk migration was built with the assumption that cloning can happen in parallel with prepared transactions on the collection, however this is not what happens in practice.

As part of default command behavior, the moveChunk command will block on reads for documents that are in the prepare state. If any documents in the initial index scan are in prepared transactions, the index scan will block on the completion of those transactions to read these documents.

This means that the beginning of the chunk migration clone phase will infinitely block in the presence of transactions that:

  1. Have documents that exist in the chunk being migrated, and
  2. Are already in the prepare state before the chunk cloner started.

This ticket is to evaluate if this behavior is acceptable, and if not, figure out a way around this behavior. If we decide to allow the moveChunk command to ignore prepare conflicts, then we will need additional machinery to track these prepared transactions.



 Comments   
Comment by Githook User [ 24/Jun/19 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-40791 Track multi-statement transaction operations for migrations at commit time

(cherry picked from commit 35424844fd9e10b042c435c83a8f1e23e42fb9e4)
Branch: v4.2
https://github.com/mongodb/mongo/commit/6386b168ec0e701ad8649b8cec58f8913a9f076a

Comment by Githook User [ 20/Jun/19 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-40791 Track multi-statement transaction operations for migrations at commit time
Branch: master
https://github.com/mongodb/mongo/commit/35424844fd9e10b042c435c83a8f1e23e42fb9e4

Comment by Blake Oler [ 13/Jun/19 ]

Reverted because the current usage of ignoring prepare conflicts doesn't allow us to write to the local subsystem. Investigating writes that are happening and will repush with proper behavior.

Comment by Githook User [ 13/Jun/19 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: Revert "SERVER-40791 Track multi-statement transaction operations for migrations at commit time"

This reverts commit dfa8658c18142c560447c7bf6f34a6f788593d28.
Branch: master
https://github.com/mongodb/mongo/commit/68d111cd7800e1d91b41d4955cf8cf5921f34130

Comment by Githook User [ 12/Jun/19 ]

Author:

{'name': 'Blake Oler', 'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake'}

Message: SERVER-40791 Track multi-statement transaction operations for migrations at commit time
Branch: master
https://github.com/mongodb/mongo/commit/dfa8658c18142c560447c7bf6f34a6f788593d28

Comment by Siyuan Zhou [ 30/Apr/19 ]

I investigated transaction_participant.cpp, I think we could add a new opObserver function since storage transaction commit can never fail, but I'm concerned that it requires the OpObserver and its caller to know a lot of details of each other, violating the abstraction of observer pattern and making the maintenance of OpObserver harder in the future.

Another idea is to move onPreparedTransactionCommit() before committing the storage transaction. Since we don't allow storage transaction commit to fail, both data WT transaction and oplog WT transaction will commit. To make sure the oplog write isn't visible before the data change so that afterCluterTime read are respected, we need to reserve an OplogSlot without using it and release it after both data and oplog writes commit as we will do in SERVER-40870.

The third idea is to release the recovery unit and the locker from the WUOW and manage them manually on prepared commit, so that we only commit the recovery unit in _commitStorageTransaction() instead of committing the WUOW and have the locker outlive both data and oplog writes. We'll need storage team to weigh in about this.

I'd prefer Blake's second solution to always add recovery unit onCommit handlers on multi-statement transaction CRUD ops if we can avoid unowned BSON copies, since when onCommit() is called on recovery unit, it's the right time to check committed data by design, avoiding blurring the boundary of sharding and transaction as we saw in other alternatives.

Comment by Blake Oler [ 30/Apr/19 ]

After some in-person discussion with siyuan.zhou, kaloian.manassiev, renctan, and myself, we have decided on the following:

We want to be know the amount and complexity of work it would take to allow migrations to start cloning while ignoring prepare conflicts.

  1. siyuan.zhou will investigate if it is possible to create a new opObserver function that can run before we commit the storage transaction on prepared transaction commit. This will allow us to avoid locking issues that were found in SERVER-39926.
  2. I will investgate if it is possible to forgo hooking onto the onCommit/onPrepare observers completely. I will mock up a POC to see if it possible and performant to always add RecoveryUnit onCommit handlers on multi-statement transaction CRUD ops. This way, we could check if a migration exists while executing callback handlers, guaranteeing that we will never miss committed transaction writes that concern a chunk being migrated.

After our separate investigations, we will decide which approach is the better approach to pursue.

Generated at Thu Feb 08 04:55:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.