[SERVER-15750] Deadlock cycle in replication among oplog producer, oplog application and replication executor threads Created: 20/Oct/14 Updated: 11/Jul/16 Resolved: 21/Oct/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 2.7.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andy Schwerin | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
What follows is a description of a deadlock cycle observed while running maintenance_non-blocking.js. In summary, the error is that one cannot wait on the replication executor while holding the bgsync mutex. The oplog application thread periodically calls tryToGoLiveAsSecondary(), which acquires the global lock in shared (S) mode, and then calls getMaintenanceMode() on the replication coordinator, which schedules and waits for a callback on the replication executor. The oplog producer thread locks the bgsync mutex (BackgroundSync::_mutex), and then tries to acquire the global intent exclusive (IX) lock, blocking behind the oplog application thread. A third thread runs setMaintenanceMode, which blocks in the replication executor trying to clear the sync source in the producer thread, which requires the bgsync mutex. So, the executor is blocked in the setMaintenanceModeHelper waiting for the bgsync mutex, but the bgsync mutex is held by the oplog producer, which is waiting for the global lock in IX mode which is blocked by the oplog application thread, which holds the global lock in S mode and is waiting for a callback to run through the replication executor. |
| Comments |
| Comment by Githook User [ 21/Oct/14 ] |
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: |