[SERVER-43778] db_repl_coordinator_test relies on specific PseudoRandom sequence Created: 03/Oct/19  Updated: 29/Oct/23  Resolved: 08/Oct/19

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done before SERVER-43641 platform/random.h causing bugs, upgra... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

https://logkeeper.mongodb.org/lobster/build/963b4308c9524506b0757603f70ff063/test/5d942c83c2ab68380ca0e7a0#bookmarks=0%2C38606&l=1

Sprint: Dev Tools 2019-10-07, Dev Tools 2019-10-21
Participants:

 Description   

An upgrade to PseudoRandom was reverted due to a test relying on the specific bits output by PseudoRandom(0). This is not a good situation (SERVER-43641).

Tests must not hardcode this sort of thing. If we do, we can never make improvements to the generators without updating all such tests.

Regarding db/repl/replication_coordinator_impl_elect_v1_test.cpp:
3 tests from the TakeoverTest/ suite are affected:

CatchupTakeoverCallbackCanceledIfElectionTimeoutRuns
DontCallForPriorityTakeoverWhenLaggedDifferentSecond
DontCallForPriorityTakeoverWhenLaggedSameSecond

These only seem to work reliably when fed a (now legacy) PseudoRandom
initialized with a seed of 0. Otherwise the election timeouts are randomized in such a way that the test doesn't reach the desired state, and it fails.
This is extremely fragile and should be fixed asap.

The ReplCoordinatorImpl takes a seed in its constructor. From this seed it makes a PseudoRandom which it uses to generate electionTimeout intervals. This is very hit-or-miss, and a test would have to hope to find a seed that puts the RS into a desired state, and such a seed, if found, would need to be updated with every little tweak of the random number generator or the interval upperBound, etc. Tests really need to directly control the election timeout durations in order to get the RS into their desired state. So really the ctor should take a Duration generator rather than a seed.

For the moment I'm going to bring the entire PseudoRandom "XorShift" implementation into the test as a generator.

PS: Another way to go here would be to use a FailPoint to inject an electionTimeout result, overriding the randomly generated result.

See also SERVER-43767 (related issue in another test)



 Comments   
Comment by Githook User [ 08/Oct/19 ]

Author:

{'username': 'BillyDonahue', 'email': 'billy.donahue@mongodb.com', 'name': 'Billy Donahue'}

Message: SERVER-43778 remove randomness fragility from ReplicationCoordinatorImpl tests
Branch: master
https://github.com/mongodb/mongo/commit/f1a3b974c9b8f9519bbc317859a8ff9bde62e72a

Comment by Billy Donahue [ 03/Oct/19 ]

Sorry, I don't see how it will fix the problem.
A specific sequence of timeout variations seems necessary to the test.
Just reducing those variations to zero doesn't yield a passing test.

Comment by William Schultz (Inactive) [ 03/Oct/19 ]

billy.donahue If there are certain tests that require election timeouts to be deterministic, I think the simplest solution would be to explicitly set the electionTimeoutOffsetLimitFraction to 0 for the tests that need it. This parameter controls the size of the random offset added to the base election timeout. If it is 0 then the election timeout values will be constant, since there is no random offset added. You can see an example of us doing this in another unit test here. I agree it's not great that certain tests rely on the specific values of the election timeout, but I think this is probably the easiest fix to avoid the problems you are seeing with the new random number generator.

Comment by Billy Donahue [ 03/Oct/19 ]

judah.schvimer I'm not entirely satisfied with the workaround I've implemented in the CR, but OTOH I don't know enough about the subject matter here to design a logical sequence of electionTimeouts to inject. The workaround unblocks the changes I need to make to random.h. What do you think?

Comment by Billy Donahue [ 03/Oct/19 ]

http://mongodbcr.appspot.com/481660028

Generated at Thu Feb 08 05:04:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.