[SERVER-3346] MAX SLAVE LAG - Features to provide a more stable Replication Set under high load Created: 29/Jun/11  Updated: 23/Jul/16  Resolved: 11/Jul/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Edward M. Goldberg Assignee: A. Jesse Jiryu Davis
Resolution: Duplicate Votes: 17
Labels: LAG, replication, slaveOk
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

A collection of servers PRIMARY and SECONDAY servers, where the query load is high on the SECONDAY servers and replication load is also very high.


Issue Links:
Depends
Duplicate
duplicates SERVER-4936 Server support for "maxStalenessMS" r... Closed
Related
is related to SERVER-12861 Introduce a maxStalenessMS option whe... Closed
is related to SERVER-4935 Mark node Recovering when replication... Closed
Participants:

 Description   

This problem exists in all current Master-Slave servers that I use today. MongoDB has the exact same issues with Replication that I see in MySQL.

The situation is:

1) Primay is adding/updating data at a rate that is near or at the MAX for the hardware.
2) Seconday server is starting to LAG due to loads other than Replication. ( other Secondary servers may be OK at the same time. )
3) Requests for "read" are allowed to provide more read performance.

Effect:

The Seconday starts to LAG and does not keep up with the flow of updates and inserts from the Primary. LAG just grows and grows.

Solution:

Provide ways for the drivers and applications to "back off" and reduce the impact on the Seconday so it can "catch-up". For example select a Seconday that has lower LAG.

Ideas:

1) Add to the slaveOk request a condition of how long the LAG is allowed to be before the request must be performed by some "other" server.
2) Do not allow requests from servers that are over MAX_SLAVE_LAG set by each servers conf file.
3) Push back on the Primary if a high load on ALL Seconday servers would stop replication for all known servers. Could be a Read-Only mode.
4) Stop allowing read requests from servers with high Loads or LAGS.

I know that the Root Cause of this issue is an overloaded cluster. What I am asking for is a nice easy "push-back" from the MongoDB and not a crash.

Today a MySQL Slave will also just stop replication if the OpLog runs too long/late. If the OpLog is very huge this situation can LAG for hours/days.

If the application is OK with a LAG that is Huge, this needs to be allowed. But for applicaitons that require a LAG of say "under 60 sec." a MongoDB feature that helps provide that service would be a great feature.

The cause of the LAG may also be a Secondary server that is used for backups and was not working for a moment due to a backup request. Today all of the Seconday servers get an equal number of requests.

Please call any time:
Cell:     916-202-1600
Skype:  EdwardMGoldberg

Edward M. Goldberg
http://myCloudWatcher.com/
e.m.g.



 Comments   
Comment by Ramon Fernandez Marina [ 23/Jul/16 ]

I have marked this ticket as a duplicate of SERVER-4936, since SERVER-12861 and this ticket were marked as a duplicate of each other.

Users interested in this feature can tune to SERVER-4936 for updates.

Regards,
Ramón.

Comment by Andy Schwerin [ 11/Jul/16 ]

I believe that this request is effectively duplicated by a combination of SERVER-12861 (for maximum tolerable lag in queries) and SERVER-24980 (for having the primary slow down or restrict writes when secondaries cannot keep up).

Comment by Kevin Rice [ 01/Mar/16 ]

This also applies to situations where the slave is running on slower hardware and batches are infrequent.
1. primary gets big set of updates in a batch for a short time. Primary is on fast hardware (SSD's) and can process very fast.
2. slave is on spinning disk. Slave sees oplog and starts to update, takes much longer.

This is okay if the rate of updates is not big and we are only using the slave as a "backup" server, or a read-assist server during the day, and the updates happen at night.

BUT: if the rate of updates gets too large, the slaves can't catch up by the start of business in the morning, or, perhaps ever.

Having the mix of hardware would allow cheaper slave servers with the tradeoff that the info might be up to time-lag seconds old. Once the time-lag was too large, though, there should be a throttle on updates so it all can catch up instead of falling over and dying.

Comment by hongyu.bi [ 21/Dec/15 ]

+1

Comment by Žygimantas Stauga [ 17/Dec/15 ]

+1

Comment by Juho Mäkinen [ 10/Sep/13 ]

A slave can be lagging behind for example when it has been resurrected from backup. Currently DBA needs to alter the configuration and mark the slave to be hidden until it has caught up, a task which this would make obsolete.

+1

Comment by Colin Howe [ 08/May/12 ]

This would be awesome. We are hosted on EC2 and sometimes are slaves get a little bit more behind than normal due to the hardware playing up. If that happens we want to stop the reads automatically. Currently, we're thinking of a watch process that changes the slave to 'hidden' if the lag goes over a certain value..

Comment by Christian Ribe [ 03/Sep/11 ]

Indeed, managing replication lag needs to be added.
There is no easy way to handle lag currently.
+1

Generated at Thu Feb 08 03:02:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.