Major - P3
A collection of servers PRIMARY and SECONDAY servers, where the query load is high on the SECONDAY servers and replication load is also very high.
This problem exists in all current Master-Slave servers that I use today. MongoDB has the exact same issues with Replication that I see in MySQL.
The situation is:
1) Primay is adding/updating data at a rate that is near or at the MAX for the hardware.
2) Seconday server is starting to LAG due to loads other than Replication. ( other Secondary servers may be OK at the same time. )
3) Requests for "read" are allowed to provide more read performance.
The Seconday starts to LAG and does not keep up with the flow of updates and inserts from the Primary. LAG just grows and grows.
Provide ways for the drivers and applications to "back off" and reduce the impact on the Seconday so it can "catch-up". For example select a Seconday that has lower LAG.
1) Add to the slaveOk request a condition of how long the LAG is allowed to be before the request must be performed by some "other" server.
2) Do not allow requests from servers that are over MAX_SLAVE_LAG set by each servers conf file.
3) Push back on the Primary if a high load on ALL Seconday servers would stop replication for all known servers. Could be a Read-Only mode.
4) Stop allowing read requests from servers with high Loads or LAGS.
I know that the Root Cause of this issue is an overloaded cluster. What I am asking for is a nice easy "push-back" from the MongoDB and not a crash.
Today a MySQL Slave will also just stop replication if the OpLog runs too long/late. If the OpLog is very huge this situation can LAG for hours/days.
If the application is OK with a LAG that is Huge, this needs to be allowed. But for applicaitons that require a LAG of say "under 60 sec." a MongoDB feature that helps provide that service would be a great feature.
The cause of the LAG may also be a Secondary server that is used for backups and was not working for a moment due to a backup request. Today all of the Seconday servers get an equal number of requests.
Please call any time:
Edward M. Goldberg
SERVER-4936 Server support for "maxStalenessMS" read preference option
- is related to
SERVER-12861 Introduce a maxStalenessMS option when querying secondaries
SERVER-4935 Mark node Recovering when replication lag exceeds a configured threshold