Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29153

Make sure replica set nodes agree on which node is primary before doing writes in ShardingTest initialization

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.4, v4.2, v4.0, v3.6
    • Sprint:
      Sharding 2020-04-20
    • Linked BF Score:
      12

      Description

      This can occur during Replica Set primary election:

      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.376+0000 c20512| 2017-03-29T17:21:06.375+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 1
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 D REPL_HB  [ReplicationExecutor] Cancelling all heartbeats.
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.382+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] Could not access any nodes within timeout when checking for additional ops to apply before finishing transition to primary. Will move forward with becoming primary anyway.
      

      Followed by

      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [replExecDBWorker-0] transition to SECONDARY
      

      roughly 10 seconds later.

      ShardingTest initialization does writes here and here that can fail if the config or shard steps down respectively.

      awaitNodesAgreeOnPrimary() seems appropriate to call before doing necessary config server writes.

      The shard writes (in the link above) does not appear to actually be necessary. It looks like the write followed by awaitReplication() was just a hack to wait for the replica set to finish setting up (https://github.com/mongodb/mongo/commit/69207258a19b68fbbbc1377f61b00a9124d78c90#diff-bcff736df8cd6507ce081e800d9d1dd0R168). In which case, we can just remove the write, which is outdated and not longer useful, and replace awaitSecondaryNodes with awaitNodesAgreeOnPrimary(), which is a lot more obvious of the intent and has the same effect.

        Attachments

          Activity

            People

            Assignee:
            jack.mulrow Jack Mulrow
            Reporter:
            dianna.hohensee Dianna Hohensee
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: