[SERVER-61334] Replication batcher uninterruptible lock deadlocks with storage change Created: 09/Nov/21 Updated: 29/Oct/23 Resolved: 11/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.2.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Russotto | Assignee: | Matthew Russotto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Replication 2021-11-15 | ||||
| Participants: | |||||
| Linked BF Score: | 135 | ||||
| Description |
|
The storage change code takes a global lock, interrupts all opCtxs, and waits for the opCtx to be destroyed. An uninterruptible global lock can deadlock with this, since the global lock won't release until the opCtx is destroyed, and the opCtx will wait forever for it. Most uses of uninterruptible locks do not run during initial sync (e.g. prepared transactions) and aren't a problem, but for some reason the ReplBatcher thread runs continuously even in initial sync, and it requires an uninterruptible lock. Can be fixed by having the batcher take an interruptible global lock before the uninterruptible section. |
| Comments |
| Comment by Githook User [ 01/Dec/21 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}Message: Before This fix takes RSTL IX and Global IS, then takes those locks recursively, and holds them a bit |
| Comment by Githook User [ 11/Nov/21 ] |
|
Author: {'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}Message: |