Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40336

ReplicationCoordinatorImpl::_random isn't robust to replica set members being started at the same time

    XMLWordPrintable

Details

    • Fully Compatible
    • ALL
    • v4.0, v3.6, v3.4
    • Repl 2019-04-08, Repl 2019-04-22
    • 0

    Description

      We've observed multiple cases in the sys-perf and sys-perf-4.0 Evergreen projects where a 2-node replica set, 2-shard cluster is restarted and one of the replica set shards fails to elect a member as primary after 11 attempts spanning ~2 minutes. Both nodes in the 2-node replica set had run for election at the same time repeatedly and consistently encountered a situation where each node had already voted for itself in that term. While random jitter is added to the election timeout, it is based on a PseudoRandom that is seeded with the current time on startup. The performance infrastructure spawns mongod processes concurrently and appears to end up in situations where the time on startup and thus the seed for ReplicationCoordinatorImpl::_random is the same.

      Attachments

        Activity

          People

            siyuan.zhou@mongodb.com Siyuan Zhou
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: