[SERVER-40594] Range deleter in prepare conflict retry loop blocks step down Created: 11/Apr/19  Updated: 29/Oct/23  Resolved: 23/May/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Matthew Saltz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-40700 Deadlock between read prepare conflic... Closed
related to SERVER-41035 Rollback should kill all user operati... Closed
is related to SERVER-39096 Prepared transactions and DDL operati... Closed
is related to SERVER-40586 step up instead of stepping down in s... Closed
is related to SERVER-40641 Ensure TTL delete in prepare conflict... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Note that this repro relies on sleeps to have the collection range deleter run after the transaction is prepared and before the step down attempt, so it may need to be repeated to trigger the hang.

(function() {
    "use strict";
 
    TestData.skipCheckingUUIDsConsistentAcrossCluster = true;
 
    // Helper to add generic txn fields to a command.
    function addTxnFieldsToCmd(cmd, lsid, txnNumber) {
        return Object.extend(
            cmd, {lsid, txnNumber: NumberLong(txnNumber), stmtId: NumberInt(0), autocommit: false});
    }
 
    const dbName = "test";
    const collName = "foo";
    const ns = dbName + "." + collName;
 
    const st = new ShardingTest({shards: 2, config: 1});
 
    // Set up sharded collection with two chunks - [-inf, 0), [0, inf)
    assert.commandWorked(st.s.adminCommand({enableSharding: dbName}));
    assert.commandWorked(st.s.adminCommand({movePrimary: dbName, to: st.shard0.shardName}));
    assert.commandWorked(st.s.adminCommand({shardCollection: ns, key: {_id: 1}}));
    assert.commandWorked(st.s.adminCommand({split: ns, middle: {_id: 0}}));
 
    // Move a chunk away from Shard0 (the donor) so its range deleter will asynchronously delete the
    // chunk's range. Flush its metadata to avoid StaleConfig during the later transaction.
    assert.commandWorked(
        st.s.adminCommand({moveChunk: ns, find: {_id: 10}, to: st.shard1.shardName}));
    assert.commandWorked(st.rs0.getPrimary().adminCommand({_flushRoutingTableCacheUpdates: ns}));
 
    // Insert a doc into the chunk still owned by the donor shard in a transaction then prepare the
    // transaction so readers of that doc will enter a prepare conflict retry loop.
    const lsid = {id: UUID()};
    const txnNumber = 0;
    assert.commandWorked(st.s.getDB(dbName).runCommand(addTxnFieldsToCmd(
        {insert: collName, documents: [{_id: -5}], startTransaction: true}, lsid, txnNumber)));
 
    assert.commandWorked(st.rs0.getPrimary().adminCommand(
        addTxnFieldsToCmd({prepareTransaction: 1}, lsid, txnNumber)));
 
    // Wait for range deleter to run. It should get stuck in a prepare conflict retry loop.
    sleep(1000);
 
    // Attempt to step down the primary. As in the description, this will fail with a lock timeout
    // if the range deleter ran after the above transaction was prepared.
    assert.commandFailedWithCode(
        st.rs0.getPrimary().adminCommand({replSetStepDown: 5, force: true}),
        ErrorCodes.LockTimeout);
 
    // Cleanup the transaction so the sharding test can shut down.
    assert.commandWorked(st.rs0.getPrimary().adminCommand(
        addTxnFieldsToCmd({abortTransaction: 1}, lsid, txnNumber)));
 
    st.stop();
})();

Sprint: Sharding 2019-05-06, Sharding 2019-05-20, Sharding 2019-06-03
Participants:
Linked BF Score: 19

 Description   

Replication step down requires the ReplicationStateTransitionLock in MODE_X and kills user operations, but it doesn't kill internal operations, like those run by the collection range deleter. If the range deleter runs and enters a prepare conflict retry loop (which waits without yielding locks), it will hang until the prepared transaction modifying the data it is reading commits or aborts. The RSTL can't be taken in exclusive mode until the range deleter operation finishes, so during this time all step down attempts will time out waiting for the RSTL.

This should also be a problem for step up (and other operations that require the RSTL) and may be triggered by other internal operations that can read prepared data, but I've only seen this so far with step down and the range deleter. The step up case might be worse, because a prepared transaction can't commit or abort and unblock an internal operation if there's no primary.



 Comments   
Comment by Githook User [ 23/May/19 ]

Author:

{'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz', 'username': 'saltzm'}

Message: SERVER-40594 Make range deleter interruptible
Branch: master
https://github.com/mongodb/mongo/commit/645c02e7a139171aec96376cedb8b159e00a0aa8

Comment by Judah Schvimer [ 13/May/19 ]

If an unconditional stepdown happens between prepare and commit, then the node that needs to step down should have to effectively wait forever for the transaction to be committed (since it will only receive commit as a secondary).

Comment by Matthew Saltz (Inactive) [ 13/May/19 ]

judah.schvimer I think the reason is that the situation described in this ticket for step down isn't a true deadlock - it's only problematic as long as the transaction never commits or aborts. As soon as the transaction completes, step down can continue and succeed. Since our tests are written to not leave prepared transactions hanging (AFAIK), I wouldn't expect to run into this in the concurrency suites. jack.mulrow can you confirm that this understanding is correct?

Comment by Judah Schvimer [ 08/May/19 ]

As part of this ticket, I would like to investigate why we aren't catching these deadlocks in our concurrency_sharded_with_stepdowns_and_balancer suite. Are chunk migrations or range deletion not getting prepare conflicts for some reason? Are we not actually moving any chunks from the balancer? CC max.hirschhorn

Comment by Suganthi Mani [ 07/May/19 ]

This ticket is not blocked on the other work  (SERVER-40700). This is an independent task just like SERVER-40641 and 4.2.0-rc0.

Comment by Judah Schvimer [ 07/May/19 ]

Are there any other parts of chunk migration that risk the same deadlock?

Comment by Judah Schvimer [ 22/Apr/19 ]

I'm marking this as blocked until SERVER-40700 is designed.

Comment by Jack Mulrow [ 11/Apr/19 ]

What internal operations are you thinking of?

None in particular - I only put that down since it came up in the sharded txn / prepare standup that this could (theoretically) also affect step up and I didn't want to forget that.

Comment by Judah Schvimer [ 11/Apr/19 ]

Note a similar discussion was held around SERVER-39096 where we decided not to let prepare conflict retry loops yield locks. SERVER-40586 was filed to test this behavior more.

Comment by Judah Schvimer [ 11/Apr/19 ]

jack.mulrow, step up doesn't kill operations at all, but also doesn't expect arbitrary internal processes to be reading or writing to user data. What internal operations are you thinking of?

Generated at Thu Feb 08 04:55:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.