[SERVER-70850] Replication signalDrainComplete is stuck in catalog shard POC Created: 25/Oct/22 Updated: 29/Oct/23 Resolved: 27/Oct/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Participants: | |
| Story Points: | 3 |
| Description |
|
Reproduced with jstests/sharding/delete_range_deletion_tasks_on_stepup_after_drop_collection.js. The test is doing step down and up on catalog shard, while doing a chunk migration that is supposed to fail. After step up, the former primary becomes primary again, and ReplicationCoordinatorImpl::signalDrainComplete() is invoked and never completes until the test ends. The side-effect of this is that the _makeHelloResponse() will always return "i am secondary", which makes the Hello reply consumer to drop it. There is a logical deadlock during the chunk migration logic resuming on step up: My opinion the step 4 to write the recoveryDoc and fetch the latest uptime cannot use the majority. Just do the local write if you are a primary. |