[SERVER-18994] producer thread can continue producing after a node becomes primary Created: 16/Jun/15 Updated: 16/Nov/15 Resolved: 07/Jul/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.0.1 |
| Fix Version/s: | 3.0.5, 3.1.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Dannenberg | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Completed: | |||||||||
| Sprint: | RPL 6 07/17/15 | ||||||||
| Participants: | |||||||||
| Description |
|
These ops are not thrown away and can lead to a deadlock between the applier and producer threads. The applier thread believes it has finished and is waiting for the producer thread to signal that it has paused. Meanwhile, the producer thread is waiting for the op that arrived late (which the applier is unaware of) to be applied. |
| Comments |
| Comment by Githook User [ 07/Jul/15 ] |
|
Author: {u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}Message: |
| Comment by Githook User [ 07/Jul/15 ] |
|
Author: {u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}Message: |
| Comment by Eric Milkie [ 06/Jul/15 ] |
|
In addition, we should remove setting _isWaitingForDrainToComplete to false in _updateMemberStateFromTopologyCoordinator_inlock() (it's not really valid to do this, as the only time it will be true is when there are still ops to process) The next thing that should happen is that all connections are closed, which will cause the producer thread to jump back up to the top of produce(), where it will detect we're in drain mode and call pause(). Finally, the applier thread will eventually block waiting for the producer thread to call pause, and then clear _isWaitingForDrainToComplete. |
| Comment by Eric Milkie [ 02/Jul/15 ] |
|
I suggest taking out the place in the produce where it waits for the applier to drain the buffer. It was added for |
| Comment by Scott Hernandez (Inactive) [ 26/Jun/15 ] |
|
After discussion, we have a plan to return errors if draining and role==leader when asked to transition to secondary. This should cover the following transition points, at least:
|