[SERVER-46907] Speed up config replication acknowledgement Created: 17/Mar/20  Updated: 29/Oct/23  Resolved: 25/May/22

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Matt Broadstone
Resolution: Fixed Votes: 0
Labels: former-quick-wins, safe-reconfig-related
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-45095 Measure the running time of safe reco... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

A series of safe replica set reconfigs must pause an unnecessary 2 seconds (a heartbeat interval) between reconfigs. The primary that receives the replSetReconfig command takes around 2 seconds to receive acknowledgement from a majority of members that they have replicated the new config. After storing a new config, the primary immediately sends a new round of heartbeat requests to all members, and these secondaries immediately fetch and install the new config upon seeing the newer configVersion. The primary, however, will only satisfy the config replication check once it has learned about the newly installed configs via heartbeat responses. Therefore, since it will take 2 seconds for the primary to send out another round of heartbeat requests after itself and other secondaries have installed the new config, it will take ~2 seconds to satisfy the config replication check. Ideally this unnecessary waiting can be eliminated.



 Comments   
Comment by Githook User [ 25/May/22 ]

Author:

{'name': 'Matt Broadstone', 'email': 'mbroadst@mongodb.com', 'username': 'mbroadst'}

Message: SERVER-46907 Speed up reconfig replication acknowledgement
Branch: master
https://github.com/mongodb/mongo/commit/25c8241437807544cd15d520a702b092146f4ace

Comment by Judah Schvimer [ 06/Apr/20 ]

We will address this if users complain about it.

Comment by Siyuan Zhou [ 17/Mar/20 ]

Agreed this is a valuable improvement. This can be solved by updating the config version/term of other nodes on learning heartbeat requests from them rather than just relying on heartbeat responses as mentioned by william.schultz. If I remembered correctly, jesse proposed to shorten the heartbeat intervals temporarily on primary as an alternative.

Comment by Judah Schvimer [ 17/Mar/20 ]

This seems like a very valuable perf improvement, especially since I expect we experience this in our tests a lot.

Comment by William Schultz (Inactive) [ 17/Mar/20 ]

As I understand it, the primary must learn of the newly installed configs via heartbeat responses, not requests from other nodes. So, I believe the biggest delay is caused by the primary not sending out new heartbeat requests for ~2 seconds (a heartbeat interval) even after all nodes may have installed the config very quickly.

Generated at Thu Feb 08 05:12:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.