[SERVER-39112] Primary drain mode can be unnecessarily slow Created: 21/Jan/19 Updated: 29/Oct/23 Resolved: 27/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.18, 3.6.10, 4.0.5, 4.1.7 |
| Fix Version/s: | 4.2.4, 4.3.4, 4.0.17 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | Matthew Russotto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.2, v4.0
|
||||||||||||||||
| Sprint: | Repl 2020-02-10, Repl 2020-02-24, Repl 2020-03-09 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
After a replica set node wins an election and transitions to PRIMARY state, it enters drain mode. In this mode, it will apply any oplog operations that were still left in its buffer from its time as a secondary. While in drain mode, a node is in PRIMARY state but cannot yet accept writes i.e. it will report isMaster:false. When the drain process has completed, the ReplicationCoordinator will be signaled by the oplog application logic in SyncTail. In the case that there are no operations to apply in drain mode, though, the newly elected primary should be able to complete drain mode immediately and begin accepting writes. This process may take up to a second or more, though, because of this hard coded 1 second timeout in the oplog application loop. This is wasted downtime where the primary could be accepting writes but is waiting for this timeout to trigger. This limits how quickly a node can step up and begin accepting writes. We should consider making this timeout configurable via an external parameter or hard-coding it at something less i.e. 100 milliseconds. Perhaps the ReplicationCoordinator could also signal the oplog application loop when it transitions to PRIMARY, letting it know it can check right away if drain mode can complete. |
| Comments |
| Comment by Githook User [ 27/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
Author: {'username': 'mtrussotto', 'name': 'Matthew Russotto', 'email': 'matthew.russotto@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 27/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Matthew Russotto', 'username': 'mtrussotto', 'email': 'matthew.russotto@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 27/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Matthew Russotto', 'username': 'mtrussotto', 'email': 'matthew.russotto@10gen.com'}Message: | ||||||||||||||||||||||||||||||||||
| Comment by Matthew Russotto [ 05/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
Since the OplogBuffer itself allows only one producer, a possible implementation might be a drain mode for the oplog buffer. We could enter oplog buffer drain mode from the producer; this would interrupt any waits and ignore future waits. Then when drain was complete, or when we restarted the oplog fetcher, we could exit the oplog buffer drain mode. This avoids adding any new synchronization between the fetcher and the batcher. | ||||||||||||||||||||||||||||||||||
| Comment by Matthew Russotto [ 05/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
A solution gets a bit tricky. What we want is, when we're in drain mode, and the oplog fetcher is not running, for the ReplBatcher to stop waiting and to not wait again until we drop out of drain mode. Drain mode is not currently synchronized with the ReplBatcher; we might be able to do that but we'd have to be very careful with the lock ordering. We cannot have the OplogBufferBlockingQueue not wait whenever the oplog fetcher is not running, as the ReplBatcher runs when we are not SECONDARY, so we depend on that 1 second wait to avoid a full cpu-burning spin. | ||||||||||||||||||||||||||||||||||
| Comment by Matthew Russotto [ 05/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
There are two waits in sync_tail, This is waiting for a batch to be available from the ReplBatcher. Since The other is this one This is waiting for a batch to become available from the oplog fetcher. This is definitely causing a delay. The oplog fetcher knows when it will no longer be adding anything to the queue (this is when it calls the finalizer callback); however, the underlying queue has no way of interrupting a wait. This may be a good reason to implement This only explains up to 1 second of delay, but I believe the multi-second delays seen in the linked help ticket may be due to a DNS infrastructure problem. (edit, fixed line number of second wait) | ||||||||||||||||||||||||||||||||||
| Comment by Matthew Russotto [ 04/Feb/20 ] | ||||||||||||||||||||||||||||||||||
|
The message "Oplog buffer has been drained" (https://github.com/mongodb/mongo/blob/b5134bcb006e74e2102de6e171a28931f84ea12f/src/mongo/db/repl/oplog_batcher.cpp#L322) means we're done; we could ping the condition variable that terminates the wait there. However, this only saves 10-20ms, so it's not the main issue. Actually we've already exited the wait there, so not helpful at all. | ||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 16/Dec/19 ] | ||||||||||||||||||||||||||||||||||
|
I ran another quick test on my local workstation after the changes from this ticket and compared the performance of drain mode when there are no ops to apply on master vs v4.2, using this test (drain.js master
v4.2
On master the mean over 3 runs was 222ms and on v4.2 the mean over 3 runs was 698ms. This definitely seems to have improved but it's not exactly clear if it could still be faster. I would also want to test it on a different machine. | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 11/Oct/19 ] | ||||||||||||||||||||||||||||||||||
|
Sounds good! Thank you. | ||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 10/Oct/19 ] | ||||||||||||||||||||||||||||||||||
|
siyuan.zhou I saw your analysis on | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 09/Oct/19 ] | ||||||||||||||||||||||||||||||||||
|
william.schultz, the hard-coded one second waiting can be signaled when the oplog buffer is drained in | ||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 04/Feb/19 ] | ||||||||||||||||||||||||||||||||||
|
I don't think the transition from secondary to primary will leave any ops in the buffer, since both the states of bgsync and applier are set in _enterDrainMode_inlock atomically, where bgsync changes its state to ProducerState::Stopped. Fetcher is then canceled. On receiving a new batch, BackgroundSync::_enqueueDocuments checks the state again. It's possible to start bgsync with ops in the buffer though. As explained by this comment, bgsync will see ops left in the buffer if the node steps down during drain mode. Although I don't think there's a correctness concern here, I do think it's worthwhile to improve the signaling. The state transition of bgsync and applier needs to be considered and the locking rules around bgsync's mutex and coordinator's mutex need to be followed. | ||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 23/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
The bug you describe around operations sticking around in the buffer seems very plausible. Additionally, do we make decisions based on lastFetched assuming that those will always be applied? | ||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 23/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
From a correctness perspective it doesn't seem harmful to throw away some ops that were in your buffer during drain mode. You would simply become primary with some prefix of the log entries you could have had. Like you said, this may mitigate the benefits of drain and/or catchup mode, though. One issue I could think of is if those operations stick around in the buffer while you are primary and remain there until you become secondary again. That could be problematic if the applier starts pulling ops out of the queue in FIFO order after you've written down a bunch of oplog entries as a primary. Perhaps we clean out this buffer at some point, though, between these state transitions. | ||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
It sounds like in this case we have an existing bug since waiting 1 second doesn't provide any assurances beyond exiting drain mode immediately. If we currently abandon some operations, what's the harm? Unnecessary rollbacks? | ||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
Yes, something like that would fix this. Synchronization between the producer and the applier used to be tricky (some bugs happened in this area), but I am hopeful the code's in a better state today to make this happen. | ||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
Perhaps the producer and applier could coordinate to signal the completion of drain mode more efficiently. To milkie's point about waiting for operations captured by the network thread, it seems that after entering drain mode we want to ensure that (1) the producer has stopped and is not going to fetch any more operations, and, once (1) holds, that the applier has no more ops to apply in its buffer. Perhaps the producer, once it is shut down and knows it will not fetch any more operations, could signal the applier to do one last check to drain its operations buffer. If the buffer is empty, the applier can signal drain completion immediately. Otherwise, it can drain the remaining ops and then signal. | ||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
I went back and checked the original code to see why it was written this way, and it turns out we've lost an important comment that accompanied the original code. The reason why we need the delay here is because the network receiving thread and the applier thread are running asynchronously, and we wanted to wait to ensure that all the operations that could be captured by the network thread are processed in drain mode. If we end drain mode immediately when the buffer appears to be empty, we might possibly be abandoning some operations that are still pending that the network thread could have captured. | ||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
I don't see why we can't just call signalDrainComplete() immediately when the buffer is first known to be empty. Then we can keep the 1 second timeout the same for the other less important checks. You'd have to change the batcher to be able to support this behavior, but shouldn't be too hard. | ||||||||||||||||||||||||||||||||||
| Comment by Alyson Cabral (Inactive) [ 22/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
This proposed change seems very impactful. Is there a tradeoff to making the timeout less? Even 100MS wasted seems high. | ||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 21/Jan/19 ] | ||||||||||||||||||||||||||||||||||
|
When running the simple demo script replset_startup.js
If we alter the aforementioned timeout to 100 milliseconds, the intervals shrink to around ~100 milliseconds or less.
This is something I noticed in testing, and was confused by why the primary step up process appeared so slow. I think that making this process faster could be beneficial for our test infrastructure i.e. making setup times in ReplSetTest quicker, and may also be important for production systems that are sensitive to lost write availability. The election handoff project, for example, made planned failovers significantly faster. I think that removing an additional, wasted second from that process would be valuable. |