Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41781

Configure shard to have deterministic failover in sharding_multiple_ns_rs.js

    • Fully Compatible
    • ALL
    • Hide

      1. run sharding_multiple_ns_rs.js
      2. trigger an additional election after the second primary is elected, but before connPoolStats returns up-to-date information

      Show
      1. run sharding_multiple_ns_rs.js 2. trigger an additional election after the second primary is elected, but before connPoolStats returns up-to-date information
    • Repl 2019-07-01, Sharding 2019-08-26, Sharding 2019-09-23, Sharding 2019-10-07, Sharding 2019-10-21, Sharding 2019-11-04
    • 9

      The test has one sharded collection and an unsharded collection. It inserts data into both, waits for replication to happen, and then kills the primary of the shard. The test fails while waiting for a new primary to be recognized; it verifies this by calling the "connPoolStats" command.

      The test gets the current primary by calling st.rs0.getPrimary() (which equals d21521 at this point) after the first one was killed. It waits for up to 5mins to verify that "connPoolStats" reflects this node as the new primary. The waiting started at 2019-06-06T16:07:22 (line 2368).

      According to the logs a new primary, d21521, was promoted at 16:07:20 (line 2183). However, this was not reflected in the output of the connPoolStats command. At 16:04:44 (line 3471), node d21522 decided to start a third election. This seems strange since there was a heartbeat between the two immediately before on line 3417. Regardless, d21522 won the election and was promoted to primary at 2019-06-06T16:07:44 (line 3605).

      The "connPoolStats" command reported d21520 (the original primary), as the primary up until 2019-06-06T16:07:50 (line 4210). After this, d21522 was returned as primary by this command. Since the test is still waiting for d21521 to become primary the test will ultimately fail after the timeout period.

      Since multiple elections can occur, this failure could be avoided by modifying awaitRSClientHosts with the ability to check who the current primary of the shard is multiple times – not just when the function is initially called. Perhaps, awaitRSClientHosts could accept either a string or a function to produce the current expected primary would work.

            Assignee:
            lamont.nelson@mongodb.com Lamont Nelson
            Reporter:
            lamont.nelson@mongodb.com Lamont Nelson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: