[SERVER-45151] Skip call to awaitNodesAgreeOnAppliedOptime during initiate if high slave delay or in multiversion test Created: 13/Dec/19 Updated: 29/Oct/23 Resolved: 16/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Samyukta Lanka | Assignee: | Samyukta Lanka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Repl 2019-12-16, Repl 2019-12-30 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 24 | ||||||||||||
| Description |
|
In the case of a test with a secondary with high slave delay, there is a situation where the secondary can exit initial sync and then an insert happens into the system.keys collection (causing the primary's lastApplied to advance). This means that when awaitNodesAgreeOnAppliedOpTime is called, the secondary will remain delayed until the test times out. In the case of a multiversion test, we skip shortening the heartbeat period. If the noop writer is turned on and set to an interval of 1 second, then that can cause awaitNodesAgreeOnAppliedOpTime to timeout because just as a node advances to meet the other node's expectation, the expectation advances. The reason for this is that the nodes will advance their understanding of the other nodes' lastApplied through heartbeats and because the interval stays at 2 seconds, nodes cannot update their view of other nodes fast enough before the next noop write happens. |
| Comments |
| Comment by Githook User [ 16/Dec/19 ] |
|
Author: {'name': 'Samyukta Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}Message: |
| Comment by Samyukta Lanka [ 16/Dec/19 ] |
|
Sorry judah.schvimer, it's safe to skip because the function call is an optimization added in Multiversion tests set the failPointsSupported flag to false, so the heartbeat interval is not turned down. This means that multiversion tests would not see a benefit from this optimization anyways. Tests that run with a high slave delay must avoid waiting for replication in all cases, so this optimization doesn't make sense for them either. |
| Comment by Judah Schvimer [ 16/Dec/19 ] |
|
samy.lanka, why is it safe to skip the function call? Why was the function call added originally and will this return some bug? Do we need to call that function ever? |