[SERVER-55379] Invariant failure _requests.empty() at src/mongo/db/concurrency/lock_state.cpp 289 Created: 19/Mar/21 Updated: 06/Dec/22 Resolved: 31/Mar/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kelsey Schubert | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2
|
||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-05-03, Execution Team 2021-05-17, Execution Team 2021-05-31, Execution Team 2021-06-14, Execution Team 2021-06-28, Execution Team 2021-07-12, Execution Team 2021-07-26 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Linked BF Score: | 150 | ||||||||||||||||||||||||||||
| Description |
|
Observed on MongoDB 4.2.12:
|
| Comments |
| Comment by Kelsey Schubert [ 31/Mar/22 ] | ||||
|
After investigation, we have determined that this issue was resolved by | ||||
| Comment by Kelsey Schubert [ 19/Aug/21 ] | ||||
|
connie.chen, I'm confused. If the bug is still present, shouldn't this ticket remain open? It's a lot easier to find an open ticket (currently it looks the bug has been resolved if you look at the jira metadata). I'm planning to reopen this ticket and park it on the backlog and move it back into needs scheduling when we have a recurrence with the new logging. Let me know if I'm missing something with my approach. | ||||
| Comment by Connie Chen [ 09/Jul/21 ] | ||||
|
Logging has already been committed on master and will be backported to 4.2 and 4.4. Reopen this ticket if invariant happens again. | ||||
| Comment by Dianna Hohensee (Inactive) [ 02/Jun/21 ] | ||||
|
So the TransactionCoordinator only appears to take a mutex(es) in the implementation. The TransactionCoordinator operation fails, as Dan says above, in a DBDirectClient command to do a write. The logs have expired, but we've still got Dan's analysis. It's unclear to me what could have gone wrong. The TransactionParticipant is the only component in the codebase using Locker instances outside of RAII types, but the TransactionCoordinator isn't involved with the TransactionParticipant at all as far as I can tell. We could add some logging information to the invariant failure. It isn't clear what information would be helpful, so basically whatever we can. ~LockerImpl checks that the lock requests are empty, which is a map of ResourceId to LockRequest -- we could log some of this information. | ||||
| Comment by Daniel Gottlieb (Inactive) [ 29/Mar/21 ] | ||||
|
I'm not the expert in this, but I don't think the transaction code is breaking any API contracts that led to this invariant. I'm passing this off to execution to take a look. My observations:
| ||||
| Comment by Kaloian Manassiev [ 20/Mar/21 ] | ||||
|
This is likely in the 2PC transactions path, because of the presence of the AsyncWorkScheduler in the stacks. Moving to Sharding-NYC. |