[SERVER-18994] producer thread can continue producing after a node becomes primary Created: 16/Jun/15  Updated: 16/Nov/15  Resolved: 07/Jul/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.1
Fix Version/s: 3.0.5, 3.1.6

Type: Bug Priority: Major - P3
Reporter: Matt Dannenberg Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-21474 Remove stepdown_while_draining test a... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: RPL 6 07/17/15
Participants:

 Description   

These ops are not thrown away and can lead to a deadlock between the applier and producer threads. The applier thread believes it has finished and is waiting for the producer thread to signal that it has paused. Meanwhile, the producer thread is waiting for the op that arrived late (which the applier is unaware of) to be applied.



 Comments   
Comment by Githook User [ 07/Jul/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-18994 rework applier draining to avoid possible deadlock
Branch: master
https://github.com/mongodb/mongo/commit/b373e66d9aca09e73040c8bbeb54bacdb91883fb

Comment by Githook User [ 07/Jul/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-18994 rework applier draining to avoid possible deadlock
Branch: v3.0
https://github.com/mongodb/mongo/commit/a07e5b9e9c31bba5d8d6da61c17c2231cb396323

Comment by Eric Milkie [ 06/Jul/15 ]

In addition, we should remove setting _isWaitingForDrainToComplete to false in _updateMemberStateFromTopologyCoordinator_inlock() (it's not really valid to do this, as the only time it will be true is when there are still ops to process)

The next thing that should happen is that all connections are closed, which will cause the producer thread to jump back up to the top of produce(), where it will detect we're in drain mode and call pause(). Finally, the applier thread will eventually block waiting for the producer thread to call pause, and then clear _isWaitingForDrainToComplete.
In this way, we don't need to block heartbeat stepdowns.

Comment by Eric Milkie [ 02/Jul/15 ]

I suggest taking out the place in the produce where it waits for the applier to drain the buffer. It was added for SERVER-8070 but there appears to be no reason to keep doing it now. Instead, we can just check for drain mode at bgsync.cpp:213 and return if we're draining.

Comment by Scott Hernandez (Inactive) [ 26/Jun/15 ]

After discussion, we have a plan to return errors if draining and role==leader when asked to transition to secondary. This should cover the following transition points, at least:

  • Reconfig
  • StepDown
  • Heartbeats
Generated at Thu Feb 08 03:49:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.