[SERVER-70850] Replication signalDrainComplete is stuck in catalog shard POC Created: 25/Oct/22  Updated: 29/Oct/23  Resolved: 27/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Participants:
Story Points: 3

 Description   

Reproduced with jstests/sharding/delete_range_deletion_tasks_on_stepup_after_drop_collection.js.

The test is doing step down and up on catalog shard, while doing a chunk migration that is supposed to fail. After step up, the former primary becomes primary again, and ReplicationCoordinatorImpl::signalDrainComplete() is invoked and never completes until the test ends.

The side-effect of this is that the _makeHelloResponse() will always return "i am secondary", which makes the Hello reply consumer to drop it.

There is a logical deadlock during the chunk migration logic resuming on step up:
1. ReplicationCoordinatorExternalStateImpl::onTransitionToPrimary
2. ReplicationCoordinatorExternalStateImpl::_shardingOnTransitionToPrimaryHook()
3. ShardingStateRecovery::recover()
4. // Need to fetch the latest uptime from the config server, so do a logging write
ShardingLogging::get(opCtx)->logChangeChecked(..., kMajorityWriteConcern)
at this point it should be clear that majority write concern during step up before the writes are allowed will not work...
5. ShardingLogging::_log()
6. Grid::get(opCtx) ->catalogClient()->insertConfigDocument(..., kMajorityWriteConcern)
7. ShardLocal::_runCommand()

My opinion the step 4 to write the recoveryDoc and fetch the latest uptime cannot use the majority. Just do the local write if you are a primary.


Generated at Thu Feb 08 06:17:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.