[SERVER-39096] Prepared transactions and DDL operations can deadlock on a secondary, if a reader blocks on a prepared document Created: 18/Jan/19 Updated: 18/Apr/19 Resolved: 04/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Xiangyu Yao (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Sprint: | Storage NYC 2019-02-11 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 9 | ||||||||||||||||||||||||||||
| Description |
|
When a transaction is prepared on secondary, a DBHash transaction on the same collection will have to wait for the prepared transaction to either commit or abort, while holding a global lock in IX mode. If a DDL operation comes in at this moment, the DDL operation will be blocked since it needs the global lock in X mode. There might be commit or abort command coming after the DDL in the oplog, however, they don't have a chance to apply. This isn't a problem on primary, since a concurrent commit / abort command can come in while DBHash is blocked, however secondaries serialize all operations. This won't be solved by acquiring weaker locks for DDL operations either, because DBHash may be holding a IX lock on one collection, but waiting for a prepared transaction in another collection. For example, there are two collection C1 and C2. The sequence of operations on a secondary is:
The prepared transaction and the DDL operation should both have succeeded on primary since they don't conflict. Only snapshot read on secondary causes this problem since they cannot yield. There might be a way to change DBHash to use a special snapshot isolation mechanism instead of read-only transaction. Locking, prepare-waiting and DDL operations (e.g. dropCollection) have to be considered when designing such an approach. In the short term, we perhaps need to disable read-only transaction on secondaries. |
| Comments |
| Comment by Xiangyu Yao (Inactive) [ 04/Feb/19 ] |
|
Discussed offline with geert.bosch and we decided a better solution for the case judah.schvimer described would be to make secondary acquire finer-grained locks (same as the primary). Filed Closing this ticket as 'Won't Fix' because we are not going to implement lock yielding during prepare conflict retry. |
| Comment by Judah Schvimer [ 04/Feb/19 ] |
|
createIndexes was just an example. I expect collMod, create, drop, dropIndexes, and renameCollection all could. |
| Comment by Eric Milkie [ 04/Feb/19 ] |
|
We now have code like this for secondary batch application: https://github.com/mongodb/mongo/blob/8cbbac8636b5cfae5dd84f308ca9092b0da8b549/src/mongo/db/repl/sync_tail.cpp#L372 |
| Comment by Judah Schvimer [ 04/Feb/19 ] |
|
To clarify the proposed problem, for (2) this isn't necessarily a DBHash, this could be any secondary read with an afterClusterTime. You make a really good point. The original problem was predicated on the DDL operation not conflicting with the prepared transaction. This was possible because the DBHash blocked on the prepared collection (through a prepare conflict) and the DDL operation's collection (through a lock conflict) separately. Due to two-phase locking, DBHash held onto the DDL operation's lock while blocked on the prepared transaction. Without 2PL (i.e. outside of a transaction) are there any secondary reads (like DBHash maybe?) that hold onto one collection lock while reading from another collection at the same time? Furthermore, we often take stronger locks on secondaries than we do on primaries. If we have a createIndex operation that takes a collection lock for collection A on the primary but a global lock on the secondary, I feel like this can still happen. This is able to happen on the primary because the createIndex takes a more granular lock. |
| Comment by Xiangyu Yao (Inactive) [ 01/Feb/19 ] |
|
judah.schvimer, I don't quite understand why this deadlock can still happen with normal afterClusterTime reads on secondary.
As described, the sequence is like the following:
Imagine we don't have that afterClusterTime read, then the serialized operations are:
But such sequence can never be generated by the primary: on primary, Step 1 holds an IX lock on "a" until Step 3; therefore, Step 2 (which requires X lock on "a") could never be done in between Step 1 and 3. |
| Comment by Geert Bosch [ 24/Jan/19 ] |
|
It may not always be safe to yield locks (for example inside DBDirectClient). Maybe safer to throw an exception in this case and wait somewhere in query. |
| Comment by Eric Milkie [ 24/Jan/19 ] |
|
Please proceed; I'm fine with that plan. |
| Comment by Siyuan Zhou [ 24/Jan/19 ] |
|
Given that secondary transaction is banned in |
| Comment by Judah Schvimer [ 23/Jan/19 ] |
|
My understanding is the WCE retry happens slightly differently from prepare conflict retrying, but I think yes, assuming both are included I think it would solve the issue by letting the DDL op acquire the lock it needs to complete. |
| Comment by Eric Milkie [ 23/Jan/19 ] |
|
I wonder if we should add lock yielding behavior to all Write Conflict Exception retrying, for plan executors that support yielding. (Not for every WCE retry, but yielding locks according to the yield policy.) Would that solve the issue here? |
| Comment by Judah Schvimer [ 23/Jan/19 ] |
|
Even removing secondary transactions, per conversation with tess.avitabile and max.hirschhorn, we think this can still happen with normal afterClusterTime reads on secondaries. The afterClusterTime read can hold a collection IS lock, and block on a prepare conflict. A DDL op being applied can then block on the afterCLusterTime read to release its lock, and the prepared transaction will block on the DDL op to finish being applied for its commit oplog entry to be applied. One idea for fixing this is to make reads yield their locks when they block on prepare conflicts. louis.williams and geert.bosch, do you see any problem with this? |
| Comment by Tess Avitabile (Inactive) [ 22/Jan/19 ] |
|
max.hirschhorn, I think it is important that we disable secondary transactions, since they are a drain on productivity, both for BF investigation and for design decisions, as siyuan.zhou described. It was a great thing for 4.0 that we were able to perform background dbhash checks, but now that secondaries keep transaction state, we cannot support secondary transactions without a holistic design. I propose that we disable secondary transactions and background dbhash checks now and design a solution for background dbhash checks at the end of the Prepare project if we have time. |
| Comment by Max Hirschhorn [ 18/Jan/19 ] |
siyuan.zhou, why does the dbhash command need to wait for the prepared transaction to commit or abort? I feel like we should be able to either (a) have the run_check_repl_dbhash_background.js hook pick a clusterTime that is earlier than the timestamp associated with any prepared transactions, or (b) change the dbhash command to return SnapshotUnavailable if there's a prepared transaction on the collection. |