Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-40336

ReplicationCoordinatorImpl::_random isn't robust to replica set members being started at the same time

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0, v3.6, v3.4
    • Sprint:
      Repl 2019-04-08, Repl 2019-04-22
    • Linked BF Score:
      0

      Description

      We've observed multiple cases in the sys-perf and sys-perf-4.0 Evergreen projects where a 2-node replica set, 2-shard cluster is restarted and one of the replica set shards fails to elect a member as primary after 11 attempts spanning ~2 minutes. Both nodes in the 2-node replica set had run for election at the same time repeatedly and consistently encountered a situation where each node had already voted for itself in that term. While random jitter is added to the election timeout, it is based on a PseudoRandom that is seeded with the current time on startup. The performance infrastructure spawns mongod processes concurrently and appears to end up in situations where the time on startup and thus the seed for ReplicationCoordinatorImpl::_random is the same.

        Attachments

          Activity

            People

            Assignee:
            siyuan.zhou Siyuan Zhou
            Reporter:
            max.hirschhorn Max Hirschhorn
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: