[SERVER-54681] Resharding recipient shards which are also donor may fail retryable writes with IncompleteTransactionHistory too early Created: 20/Feb/21  Updated: 29/Oct/23  Resolved: 18/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Alexander Taskov (Inactive)
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-config-txn-clone
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-55683 Remove waiting for minimum duration f... Closed
Related
related to SERVER-55214 Resharding txn cloner can miss config... Closed
is related to SERVER-54626 Retryable writes may execute more tha... Closed
is related to SERVER-52921 Integrate config.transactions cloner ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

python buildscripts/resmoke.py run --suite=sharding resharding_inplace_retryable_writes.js

resharding_inplace_retryable_writes.js

(function() {
"use strict";
 
load("jstests/sharding/libs/resharding_test_fixture.js");
 
const reshardingTest = new ReshardingTest({numDonors: 2, numRecipients: 2, reshardInPlace: true});
reshardingTest.setup();
 
const donorShardNames = reshardingTest.donorShardNames;
const sourceCollection = reshardingTest.createShardedCollection({
    ns: "reshardingDb.coll",
    shardKeyPattern: {oldKey: 1},
    chunks: [
        {min: {oldKey: MinKey}, max: {oldKey: 0}, shard: donorShardNames[0]},
        {min: {oldKey: 0}, max: {oldKey: MaxKey}, shard: donorShardNames[1]},
    ],
});
 
assert.commandWorked(sourceCollection.insert([
    {_id: "stays on shard0", oldKey: -10, newKey: -10, counter: 0},
    {_id: "moves to shard0", oldKey: 10, newKey: -10, counter: 0},
]));
 
const mongos = sourceCollection.getMongo();
const session = mongos.startSession({causalConsistency: false, retryWrites: false});
const sessionCollection = session.getDatabase(sourceCollection.getDB().getName())
                              .getCollection(sourceCollection.getName());
 
function runRetryableWrite(phase, expectedErrorCode = ErrorCodes.OK) {
    const res = sessionCollection.runCommand("update", {
        updates: [
            {q: {_id: "stays on shard0"}, u: {$inc: {counter: 1}}},
            {q: {_id: "moves to shard0"}, u: {$inc: {counter: 1}}},
        ],
        txnNumber: NumberLong(1)
    });
 
    if (expectedErrorCode === ErrorCodes.OK) {
        assert.commandWorked(res);
    } else {
        assert.commandFailedWithCode(res, expectedErrorCode);
    }
 
    const docs = sourceCollection.find().toArray();
    assert.eq(2, docs.length, {docs});
 
    for (const doc of docs) {
        assert.eq(1,
                  doc.counter,
                  {message: `retryable write executed more than once ${phase}`, id: doc._id, docs});
    }
}
 
runRetryableWrite("before resharding");
 
const recipientShardNames = reshardingTest.recipientShardNames;
reshardingTest.withReshardingInBackground(  //
    {
        newShardKeyPattern: {newKey: 1},
        newChunks: [
            {min: {newKey: MinKey}, max: {newKey: 0}, shard: recipientShardNames[0]},
            {min: {newKey: 0}, max: {newKey: MaxKey}, shard: recipientShardNames[1]},
        ],
    },
    () => {
        runRetryableWrite("during resharding");
 
        assert.soon(() => {
            const coordinatorDoc = mongos.getCollection("config.reshardingOperations").findOne({
                nss: sourceCollection.getFullName()
            });
 
            return coordinatorDoc !== null && coordinatorDoc.fetchTimestamp !== undefined;
        });
 
        runRetryableWrite("during resharding after fetchTimestamp was chosen");
 
        assert.soon(() => {
            const coordinatorDoc = mongos.getCollection("config.reshardingOperations").findOne({
                nss: sourceCollection.getFullName()
            });
 
            return coordinatorDoc !== null && coordinatorDoc.state === "applying";
        });
 
        runRetryableWrite("during resharding after collection cloning had finished");
    });
 
runRetryableWrite("after resharding", ErrorCodes.IncompleteTransactionHistory);
 
reshardingTest.teardown();
})();

Sprint: Sharding 2021-03-08, Sharding 2021-03-22
Participants:
Story Points: 1

 Description   

The RecipientStateMachine skips constructing a ReshardingTxnCloner for itself when the recipient shard is also a donor shard. The intention was to allow retryable writes started before the resharding operation to remain retryable during the resharding operation. However, this approach is insufficient because a recipient shard may write the kIncompleteHistoryStmtId entry due to the config.transactions entry of another donor shard that had also participated in the same retryable write.

void ReshardingRecipientService::RecipientStateMachine::_initTxnCloner(
    OperationContext* opCtx, const Timestamp& fetchTimestamp) {
    auto catalogCache = Grid::get(opCtx)->catalogCache();
    auto routingInfo = catalogCache->getShardedCollectionRoutingInfo(opCtx, _recipientDoc.getNss());
    std::set<ShardId> shardList;
 
    const auto myShardId = ShardingState::get(opCtx)->shardId();
    routingInfo.getAllShardIds(&shardList);
    shardList.erase(myShardId);
 
    for (const auto& shard : shardList) {
        _txnCloners.push_back(
            std::make_unique<ReshardingTxnCloner>(ReshardingSourceId(_id, shard), fetchTimestamp));
    }
}

https://github.com/mongodb/mongo/blob/2b397f74e6a05b7c11e63a61d4aaf9443f58b150/src/mongo/db/s/resharding/resharding_recipient_service.cpp#L369



 Comments   
Comment by Githook User [ 18/Mar/21 ]

Author:

{'name': 'Alex Taskov', 'email': 'alex.taskov@mongodb.com', 'username': 'alextaskov'}

Message: SERVER-54681 Delay start of txnCloners to prevent early write of kIncompleteHistoryStmtId during resharding operation
Branch: master
https://github.com/mongodb/mongo/commit/272954c26742e40884d5532366d7ec419d1b13d4

Comment by Max Hirschhorn [ 20/Feb/21 ]

I think the only solution is to

  1. Change recipient shards to construct a ReshardingTxnCloner for themselves when they are also donor shards.
  2. Wait reshardingMinimumOperationDurationMillis on the recipient shards before starting any of the ReshardingTxnCloners. It would be good to overlap the reshardingMinimumOperationDurationMillis with the time spent by ReshardingCollectionCloner::run() instead of having them continue to execute sequentially.
    • The reshardingMinimumOperationDurationMillis value should come from the config server via the config.collections entry rather than the recipient shard consulting its own server parameter value.

The following sequence demonstrates a recipient shard writing the kIncompleteHistoryStmtId entry too early because it doesn't know that it will also eventually participate in the same retryable write.

Client 1 Client 2 Shard 1 (both a donor and a recipient) Shard 2 (both a donor and a recipient)
Client 1 performs retryable write      
...   Shard 1 (as a donor) records stmtId 1  
Client 1 gets a network error and will retry later      
  Client 2 runs reshardCollection    
    Shard 1 (as a donor) chooses minFetchTimestamp Shard 2 (as a donor) chooses minFetchTimestamp
      Shard 2 (as a recipient) records kIncompleteHistoryStmtId for retryable write performed on Shard 1 (as a donor) without realizing that it may also eventually become a participant for that same retryable write
Client 1 retries the retryable write      
...   Shard 1 (as a donor) acks it has already done stmtId 1  
...     Shard 2 (as a donor) errors with IncompleteTransactionHistory

With the solution to stretch out the kCloning portion of resharding on the recipient shards to be at least reshardingMinimumOperationDurationMillis, Shard 2 (as a recipient) wouldn't record kIncompleteHistoryStmtId for the retryable write performed on Shard 1 (as a donor) for reshardingMinimumOperationDurationMillis amount of time and therefore Client 1 retries the retryable write would succeed.

Another way to frame the changes to RecipientStateMachine is to have ReshardingCollectionCloner::run() and ReshardingOplogFetcher::schedule() be called immediately in _cloneThenTransitionToApplying() and for ReshardingTxnCloner::run() to be called after a delay of 5 minutes (the default reshardingMinimumOperationDurationMillis value). And the transition to kApplying would still wait for the futures returned by ReshardingCollectionCloner::run() and ReshardingTxnCloner::run() to have all become ready.

Generated at Thu Feb 08 05:34:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.