[SERVER-73915] TransactionCoordinatorService may stall primary step-up from completing when replica set shard steps down and back up quickly Created: 11/Feb/23 Updated: 29/Oct/23 Resolved: 02/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.4.0, 5.0.0, 6.0.0, 6.3.0-rc0 |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | David Chen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding NYC
|
||||
| Backwards Compatibility: | Minor Change | ||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Linked BF Score: | 5 | ||||
| Story Points: | 3 | ||||
| Description |
|
All TransactionCoordinators from the previous term when the node was primary must have exited before a node can finish stepping up as primary. The mechanisms for interrupting TransactionCoordinators involves interrupting active OperationContext and shutting down the txn::AsyncWorkScheduler's TaskExecutor. However the TransactionCoordinator also waits through the WaitForMajorityService and isn't guaranteed to be interrupted. This results in the node completing its member state PRIMARY transition but being unable to exit "drain mode" where the node can accepts writes as primary. One visible symptom of this behavior is for the following message to be logged every 5 seconds.
|
| Comments |
| Comment by Githook User [ 03/Aug/23 ] |
|
Author: {'name': 'David Chen', 'email': 'david.chen@mongodb.com', 'username': ''}Message: |
| Comment by Githook User [ 02/Aug/23 ] |
|
Author: {'name': 'David Chen', 'email': 'david.chen@mongodb.com', 'username': ''}Message: |