Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39112

Primary drain mode can be unnecessarily slow

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 3.4.18, 4.0.5, 3.6.10, 4.1.7
    • Fix Version/s: Backlog
    • Component/s: Replication
    • Labels:
      None

      Description

      After a replica set node wins an election and transitions to PRIMARY state, it enters drain mode. In this mode, it will apply any oplog operations that were still left in its buffer from its time as a secondary. While in drain mode, a node is in PRIMARY state but cannot yet accept writes i.e. it will report isMaster:false. When the drain process has completed, the ReplicationCoordinator will be signaled by the oplog application logic in SyncTail. In the case that there are no operations to apply in drain mode, though, the newly elected primary should be able to complete drain mode immediately and begin accepting writes. This process may take up to a second or more, though, because of this hard coded 1 second timeout in the oplog application loop. This is wasted downtime where the primary could be accepting writes but is waiting for this timeout to trigger. This limits how quickly a node can step up and begin accepting writes. We should consider making this timeout configurable via an external parameter or hard-coding it at something less i.e. 100 milliseconds. Perhaps the ReplicationCoordinator could also signal the oplog application loop when it transitions to PRIMARY, letting it know it can check right away if drain mode can complete.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated: