[SERVER-45151] Skip call to awaitNodesAgreeOnAppliedOptime during initiate if high slave delay or in multiversion test Created: 13/Dec/19  Updated: 29/Oct/23  Resolved: 16/Dec/19

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Bug Priority: Major - P3
Reporter: Samyukta Lanka Assignee: Samyukta Lanka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-43766 Investigate the slowest sections of R... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-12-16, Repl 2019-12-30
Participants:
Linked BF Score: 24

 Description   

In the case of a test with a secondary with high slave delay, there is a situation where the secondary can exit initial sync and then an insert happens into the system.keys collection (causing the primary's lastApplied to advance). This means that when awaitNodesAgreeOnAppliedOpTime is called, the secondary will remain delayed until the test times out.

In the case of a multiversion test, we skip shortening the heartbeat period. If the noop writer is turned on and set to an interval of 1 second, then that can cause awaitNodesAgreeOnAppliedOpTime to timeout because just as a node advances to meet the other node's expectation, the expectation advances. The reason for this is that the nodes will advance their understanding of the other nodes' lastApplied through heartbeats and because the interval stays at 2 seconds, nodes cannot update their view of other nodes fast enough before the next noop write happens.



 Comments   
Comment by Githook User [ 16/Dec/19 ]

Author:

{'name': 'Samyukta Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}

Message: SERVER-45151 Skip call to awaitNodesAgreeOnAppliedOptime during initiate if high slave delay or in multiversion test
Branch: master
https://github.com/mongodb/mongo/commit/0b0e23318cc5f15297a94524b5e3c1c51decbff1

Comment by Samyukta Lanka [ 16/Dec/19 ]

Sorry judah.schvimer, it's safe to skip because the function call is an optimization added in SERVER-43766. We don't need to call that function as a part of initiate, but it improves performance because we wait for replication while having a lower heartbeat interval.

Multiversion tests set the failPointsSupported flag to false, so the heartbeat interval is not turned down. This means that multiversion tests would not see a benefit from this optimization anyways.

Tests that run with a high slave delay must avoid waiting for replication in all cases, so this optimization doesn't make sense for them either.

Comment by Judah Schvimer [ 16/Dec/19 ]

samy.lanka, why is it safe to skip the function call? Why was the function call added originally and will this return some bug? Do we need to call that function ever?

Generated at Thu Feb 08 05:08:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.