Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-68514

Delay announcement of new primary until first oplog entry in term is majority committed

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      During investigation into majority-unavailability times during planned maintenance, some several-second majority-write unavailability windows were noted. These were associated with clients doing a high volume of w:1 writes when the scheduled failover happened.

      The cause appears to be very heavy write load on the new primary as soon as it announces its own write-availability. This causes both establishment of the OplogFetcher connections and the subsequent oplog query to be very slow – from hundreds of milliseconds to several seconds to retrieve 16MB of data.

      A POC to alleviate this issue by delaying the announcement of the new primary (via the 'hello' command) until after the commit point of the set reaches the new term was implemented, and reduced majority unavailability (measured to the first user write), by 40-50%. This means replication gets a chance to set up the spanning tree before the write load hits the new primary.

      The longest delay in the mixed-workloads-with-stepdowns test without this change was around 5400ms, reduced to 2700ms with it. Average reduction over terms with load was from 640ms to 370ms.

      The disadvantages include an increase in w:1 write and primary-only read availability. On average this was minor – from ~10ms to ~15ms – but there were a few unexplained jumps to 250ms or 500ms. Another disadvantage is that if the majority point of the set is already lagged for any reason, no w:1 writes can occur until it is caught up, which may result in long w:1 unavailability in applications where w:majority is not being used.

      This would supersede SERVER-53813, which would only delay majority reads on the new primary, not announcement of it for all purposes.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            matthew.russotto@mongodb.com Matthew Russotto
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: