[SERVER-41355] Step down should call yieldLocksForPreparedTransactions w/o holding repl mutex lock (ReplicationCoordinatorImpl::_mutex). Created: 29/May/19 Updated: 29/Oct/23 Resolved: 15/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.0-rc3, 4.3.1 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Suganthi Mani |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||
| Sprint: | Repl 2019-06-03, Repl 2019-06-17, Repl 2019-07-01, Repl 2019-07-15, Repl 2019-07-29 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 10 | ||||||||||||||||
| Description |
|
Currently, step down calls yieldLocksForPreparedTransactions by holding both RSTL and repl mutex lock. As a result, this can deadlock with prepared txn threads that have checked out the session. Consider the below case. 1) Thread A (txn cmd) has checked out the session.
|
| Comments |
| Comment by Githook User [ 15/Jul/19 ] |
|
Author: {'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}Message: (cherry picked from commit cc1a75e4a6d8de8478e7253da7bd6376052d57a6) |
| Comment by Githook User [ 15/Jul/19 ] |
|
Author: {'name': 'Suganthi Mani', 'username': 'smani87', 'email': 'suganthi.mani@mongodb.com'}Message: |
| Comment by Suganthi Mani [ 18/Jun/19 ] |
|
Investigation: For step down cmd, we call yieldLocksForPreparedTransactions() only after stepping down (i.e. TopologyCoordinator::_role is set to Role::kFollower and TopologyCoordinator::_leaderMode is set to LeaderMode::kNotLeader) but the member state in the replicationCoordinator is not yet updated. Concurrent step up:
Concurrent step down:
To be noted, releasing repl mutex lock before calling yieldLocksForPreparedTransactions() is not a problem for unconditional step down code paths, as we haven't stepped down (i.e. _role or _leader value of the topologyCoordinator are not yet changed)before yieldLocksForPreparedTransactions(). So, no concurrent step ups or step downs can happen. Solution: 1) TopologyCoordinator::attemptStepDown should not perform step down (i.e. should not change the role or leaderMode), instead it will upgrade its status from conditional step down to unconditional step down (i.e the leaderMode will gets transitioned from kAttemptingStepDown to kSteppingDown). By doing this, we prevent any concurrent step ups and step downs. Also, we guarantee that the step down cmd won't fail after this point and safe to release the repl mutex lock w/ RSTL lock held. |