[SERVER-42772] race between TransactionCoordinatorService::joinPreviousRound and coordinator destruction can trigger invariant Created: 12/Aug/19 Updated: 29/Oct/23 Resolved: 21/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1, 4.2.7 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.2
|
||||||||
| Sprint: | Sharding 2019-08-26 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 12 | ||||||||
| Description |
|
Example scenario (from cpp test): The coordinator gets stored in the function object during commit, promises gets fulfilled, the main test gets unblocked here, coordinator gets removed from catalog and when the CoordinatorService tries to join, it will see that there are no coordinator in the catalog, and then it will still trigger the _quiesced invariant because one of the txnCoordinator shared_ptr is still alive in the function object call back inside the PromiseFuture. Probably very unlikely to happen in an actual mongod process since it takes a long time before a node would transition to primary after becoming a secondary. |
| Comments |
| Comment by Githook User [ 22/Apr/20 ] |
|
Author: {'name': 'Matthew Saltz', 'email': 'matthew.saltz@mongodb.com', 'username': 'saltzm'}Message: (cherry picked from commit 05a99879a6369d6220356d099ac5625d359ffd4d) |
| Comment by Githook User [ 21/Aug/19 ] |
|
Author: {'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}Message: |
| Comment by Esha Maharishi (Inactive) [ 13/Aug/19 ] |
|
kaloian.manassiev Hmm. In that case, the TransactionCoordinator should set an error on the completion promise, I think. There is no reason for a TransactionCoordinator to complete with neither a decision nor error. |
| Comment by Kaloian Manassiev [ 13/Aug/19 ] |
|
SGTM as well, but I just can't remember why we had the completion and the decision as separate methods. I have suspicion that this was because you can have a TC complete, but not produce decision (such as for example on step down or early cancelation). |
| Comment by Randolph Tan [ 13/Aug/19 ] |
|
esha.maharishi the plan sounds good to me. |
| Comment by Esha Maharishi (Inactive) [ 13/Aug/19 ] |
|
Hmm. Maybe we should just make the TransactionCoordinator's completion promises Future<txn::CommitDecision> rather than Future<void>, so that TransactionCoordinatorService::coordinateCommit does not need to capture the coordinator in a lambda. renctan, kaloian.manassiev, what do you think? |