Major - P3
5.2.0, 5.3.1, 5.2.1, 5.0.9, 4.4.15, 4.2.21
(copied to CRM)
Starting in MongoDB v4.2 the Flow Control Mechanism was introduced in order to limit the rate at which the primary applies its writes with the goal of keeping the majority committed lag under a configurable maximum value of flowControlTargetLagSeconds.
At the same time, and whenever replication chaining is enabled, the sync source of a secondary will be changed if the most recent OpTime of the sync source is more than maxSyncSourceLagSecs seconds behind another member's latest oplog entry. This ensures that the sync source is not too far behind other nodes in the set. maxSyncSourceLagSecs is a server parameter and has a default value of 30 seconds.
The problem is that the value of maxSyncSourceLagSecs is bigger (3x) than the default value of 10 seconds for flowControlTargetLagSeconds and that can result in primary nodes being throttled by the Flow Control mechanism just because one secondary lags behind while enough secondary nodes to make up a majority also replicate from it. Imagine the following scenario:
- Pri (DC1), Sec (DC1), Lagged-Sec (DC2), Chained-Sec (DC3), Chained-Sec (DC3)
- Sec syncs from Pri, Lagged-Sec syncs from Pri, and Chained-Sec syncs from Lagged-Sec
- If there are any issues with Lagged-Sec that results in accumulating replication lag between 10 to 30 seconds, it ends up with 3 nodes having lag of above 10 seconds and kicking off Flow Control
- The above can result in severe impact to applications and it could go on and on if the lag floats between 10 to 30 seconds, or the lag presents as isolated spikes on that very same range.
If MongoDB were to consider the interplay between maxSyncSourceLagSecs and flowControlTargetLagSeconds in enviornments with chained replication enabled and revaluate its sync source before hitting flowControlTargetLagSeconds (or maybe shortly after?), then situations like the above would be avoided.
Some options I thought of:
- Gossip the value of flowControlTargetLagSeconds from the current Primary to the other replica set members and automatically adjust maxSyncSourceLagSecs as a percentage of the former value. This would apply only when replication chaining is enabled.
- Consider this dynamic adjustment only for nodes that have votes and priority set to 1 or above, meaning that they count towards the majority committed lag/point.