[SERVER-46907] Speed up config replication acknowledgement Created: 17/Mar/20 Updated: 29/Oct/23 Resolved: 25/May/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 6.1.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | Matt Broadstone |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins, safe-reconfig-related | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Participants: | |||||||||
| Description |
|
A series of safe replica set reconfigs must pause an unnecessary 2 seconds (a heartbeat interval) between reconfigs. The primary that receives the replSetReconfig command takes around 2 seconds to receive acknowledgement from a majority of members that they have replicated the new config. After storing a new config, the primary immediately sends a new round of heartbeat requests to all members, and these secondaries immediately fetch and install the new config upon seeing the newer configVersion. The primary, however, will only satisfy the config replication check once it has learned about the newly installed configs via heartbeat responses. Therefore, since it will take 2 seconds for the primary to send out another round of heartbeat requests after itself and other secondaries have installed the new config, it will take ~2 seconds to satisfy the config replication check. Ideally this unnecessary waiting can be eliminated. |
| Comments |
| Comment by Githook User [ 25/May/22 ] |
|
Author: {'name': 'Matt Broadstone', 'email': 'mbroadst@mongodb.com', 'username': 'mbroadst'}Message: |
| Comment by Judah Schvimer [ 06/Apr/20 ] |
|
We will address this if users complain about it. |
| Comment by Siyuan Zhou [ 17/Mar/20 ] |
|
Agreed this is a valuable improvement. This can be solved by updating the config version/term of other nodes on learning heartbeat requests from them rather than just relying on heartbeat responses as mentioned by william.schultz. If I remembered correctly, jesse proposed to shorten the heartbeat intervals temporarily on primary as an alternative. |
| Comment by Judah Schvimer [ 17/Mar/20 ] |
|
This seems like a very valuable perf improvement, especially since I expect we experience this in our tests a lot. |
| Comment by William Schultz (Inactive) [ 17/Mar/20 ] |
|
As I understand it, the primary must learn of the newly installed configs via heartbeat responses, not requests from other nodes. So, I believe the biggest delay is caused by the primary not sending out new heartbeat requests for ~2 seconds (a heartbeat interval) even after all nodes may have installed the config very quickly. |