[SERVER-46669] moveChunk may succeed but not respect waitForDelete=true if replica set shard primary steps down Created: 06/Mar/20  Updated: 06/Dec/22  Resolved: 23/Aug/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.0, 3.6.0, 4.0.0, 4.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File server42192.patch    
Issue Links:
Problem/Incident
is caused by SERVER-26307 MigrationManager can keep a migration... Closed
Related
related to SERVER-53094 Tests which use {waitForDelete:true} ... Closed
related to SERVER-66716 WaitForDelete may not be honored in c... Closed
related to SERVER-64181 Remove TODO listed in SERVER-46669 Closed
is related to SERVER-25999 Mongos applies errors received from c... Closed
is related to SERVER-42192 Write a concurrency workload to test ... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Steps To Reproduce:

Apply server42192.patch to allow the moveChunk command to be automatically retried in the presence of failovers and run the agg_with_chunk_migrations.js FSM workload. The --repeat is necessary because while this concurrency test reproduces the issue often, it doesn't happen every time.

python buildscripts/resmoke.py --suite=concurrency_sharded_multi_stmt_txn_terminate_primary jstests/concurrency/fsm_workloads/agg_with_chunk_migrations.js --repeat=10

Participants:

 Description   

The changes from cc8e8a1 as part of SERVER-26307 made it so a BalancerInterrupted error response is no longer returned when the moveChunk command fails due to a retryable error on the replica set shard primary. Additionally, the changes from 53efde3 as part of SERVER-25999 made it so an OperationFailed error status would be returned by MigrationManager::_processRemoteCommandResponse(); however, any non-BalancerInterrupted error status is converted to an ok=1 response so long as the chunk has successfully been migrated. It does not check if _waitForDelete=true had been specified in the moveChunk command request to realize that we may not have waited long enough for the range to be cleaned up.

We should either (a) wait long enough, or (b) preserve the OperationFailed error response as a way to inform the user.

Status commandStatus = _processRemoteCommandResponse(
    remoteCommandResponse, &statusWithScopedMigrationRequest.getValue());
 
// Migration calls can be interrupted after the metadata is committed but before the command
// finishes the waitForDelete stage. Any failovers, therefore, must always cause the moveChunk
// command to be retried so as to assure that the waitForDelete promise of a successful command
// has been fulfilled.
if (chunk->getShardId() == migrateInfo.to && commandStatus != ErrorCodes::BalancerInterrupted) {
    return Status::OK();
}



 Comments   
Comment by Cris Insignares Cuello [ 23/Aug/22 ]

Customers use waitfordelete to slow down migrations and not for duplicate key verification.

Comment by Max Hirschhorn [ 06/Mar/20 ]

Please note that I filled in the affects version based on my analysis of the C++ code. I didn't actually attempt to reproduce this issue on any stable branches. We may end up wanting to write a targeted test along with any code changes because backporting the new concurrency testing from SERVER-42192 may prove to be difficult.

Generated at Thu Feb 08 05:12:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.