Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29153

Make sure replica set nodes agree on which node is primary before doing writes in ShardingTest initialization

    XMLWordPrintableJSON

Details

    • Fully Compatible
    • v4.4, v4.2, v4.0, v3.6
    • Sharding 2020-04-20
    • 12

    Description

      This can occur during Replica Set primary election:

      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.376+0000 c20512| 2017-03-29T17:21:06.375+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 1
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.381+0000 c20512| 2017-03-29T17:21:06.381+0000 D REPL_HB  [ReplicationExecutor] Cancelling all heartbeats.
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:06.382+0000 c20512| 2017-03-29T17:21:06.381+0000 I REPL     [ReplicationExecutor] Could not access any nodes within timeout when checking for additional ops to apply before finishing transition to primary. Will move forward with becoming primary anyway.
      

      Followed by

      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
      [js_test:mapReduce_nonSharded] 2017-03-29T17:21:18.428+0000 c20512| 2017-03-29T17:21:18.428+0000 I REPL     [replExecDBWorker-0] transition to SECONDARY
      

      roughly 10 seconds later.

      ShardingTest initialization does writes here and here that can fail if the config or shard steps down respectively.

      awaitNodesAgreeOnPrimary() seems appropriate to call before doing necessary config server writes.

      The shard writes (in the link above) does not appear to actually be necessary. It looks like the write followed by awaitReplication() was just a hack to wait for the replica set to finish setting up (https://github.com/mongodb/mongo/commit/69207258a19b68fbbbc1377f61b00a9124d78c90#diff-bcff736df8cd6507ce081e800d9d1dd0R168). In which case, we can just remove the write, which is outdated and not longer useful, and replace awaitSecondaryNodes with awaitNodesAgreeOnPrimary(), which is a lot more obvious of the intent and has the same effect.

      Attachments

        Activity

          People

            jack.mulrow@mongodb.com Jack Mulrow
            dianna.hohensee@mongodb.com Dianna Hohensee
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: