Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-13929

Investigate changes in SERVER-43904: When stepping down, step up doesn't filter out frozen nodes

      Description

      Downstream Change Summary

      I'm not sure if we track what heartbeats consist of, sorry if this doesn't actually need downstream team attention!

      I added a heartbeat field, 'electable', to heartbeat responses. This tells the heartbeat response recipient if the node is electable to be primary or not. If a node has 'electable' set to false, when the primary looks for a secondary to step up during election handoff, it will skip choosing that node as the new primary (since it is not electable)

      Description of Linked Ticket

      One of the recommended ways [0] to force a particular node to become primary is to freeze all non-candidate nodes and then call replSetStepDown on the primary. As of MongoDB 3.6, that code attempts to step up a candidate (by calling replSetStepUp). However, that code doesn't exclude frozen nodes, and attempting to step up a frozen node will simply fail ("2019-10-09T00:24:05.517+0000 I REPL [conn352334] Not starting an election for a replSetStepUp request, since we are not electable due to: Not standing for election because I am still waiting for stepdown period to end at 2019-10-09T00:33:59.473+0000 (mask 0x20)"). This isn't particularly bad, since the unfrozen node will actually call for, and win, an election, but it does make failovers slower (up to electionTimeoutMillis slower, presumably).

      An alternative approach that we're using, that isn't explicitly documented, is to increase the priority of both the current and candidate node, and then run replSetStepDown. I've verified both in code and logs that this is effective at getting mongo to step up the candidate node consistently. It might be nice to document this approach, since I think it offers improvements over both approaches currently mentioned. Increasing the priority on just the candidate works, but tends to be slower since the "priority takeover" mechanism takes a few seconds to trigger, and provides less control than an explicit replSetStepDown.

      [0] https://docs.mongodb.com/manual/tutorial/force-member-to-be-primary/

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            Unassigned Unassigned
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:
              2 years, 39 weeks, 3 days ago