[SERVER-40336] ReplicationCoordinatorImpl::_random isn't robust to replica set members being started at the same time Created: 26/Mar/19  Updated: 29/Oct/23  Resolved: 08/Apr/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.22, 3.6.14, 4.1.10, 4.0.11

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6, v3.4
Sprint: Repl 2019-04-08, Repl 2019-04-22
Participants:
Linked BF Score: 0

 Description   

We've observed multiple cases in the sys-perf and sys-perf-4.0 Evergreen projects where a 2-node replica set, 2-shard cluster is restarted and one of the replica set shards fails to elect a member as primary after 11 attempts spanning ~2 minutes. Both nodes in the 2-node replica set had run for election at the same time repeatedly and consistently encountered a situation where each node had already voted for itself in that term. While random jitter is added to the election timeout, it is based on a PseudoRandom that is seeded with the current time on startup. The performance infrastructure spawns mongod processes concurrently and appears to end up in situations where the time on startup and thus the seed for ReplicationCoordinatorImpl::_random is the same.



 Comments   
Comment by Githook User [ 14/Jun/19 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-40336 Use SecureRandom to seed the random number generator in replication coordinator.

(cherry picked from commit c600aa9d7423eca8151daf626e2799d9a6c7b31c)
Branch: v4.0
https://github.com/mongodb/mongo/commit/51547d1484c8d885a206d6087c164f9cf3e87e64

Comment by Githook User [ 14/Jun/19 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-40336 Use SecureRandom to seed the random number generator in replication coordinator.

(cherry picked from commit c600aa9d7423eca8151daf626e2799d9a6c7b31c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/89f180375cdde8ed4c5ed72c74af4c48d9a3f401

Comment by Githook User [ 14/Jun/19 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-40336 Use SecureRandom to seed the random number generator in replication coordinator.

(cherry picked from commit c600aa9d7423eca8151daf626e2799d9a6c7b31c)
Branch: v3.4
https://github.com/mongodb/mongo/commit/9d0331eb17fddb3ae2215648363223cc0ae03a0f

Comment by Githook User [ 08/Apr/19 ]

Author:

{'name': 'Siyuan Zhou', 'username': 'visualzhou', 'email': 'siyuan.zhou@mongodb.com'}

Message: SERVER-40336 Use SecureRandom to seed the random number generator in replication coordinator.
Branch: master
https://github.com/mongodb/mongo/commit/c600aa9d7423eca8151daf626e2799d9a6c7b31c

Comment by Andy Schwerin [ 01/Apr/19 ]

Ha! That was probably constructed before we had a good implementation of SecureRandom. I suggest we initialize the PRNG with the output of an instance of SecureRandom as a short-term fix. I suspect we could also adjust the constructor of ReplicationCoordinatorImpl to accept a RNG instead of a seed for a PRNG it constructs itself. If RCI uses its random number generator infrequently enough, we might consider just letting it use SecureRandom in production and keeping the PRNG for deterministic unit testing only.

Generated at Thu Feb 08 04:54:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.