Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3346

MAX SLAVE LAG - Features to provide a more stable Replication Set under high load

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 1.8.1
    • Fix Version/s: None
    • Component/s: Replication
    • Environment:
      A collection of servers PRIMARY and SECONDAY servers, where the query load is high on the SECONDAY servers and replication load is also very high.

      Description

      This problem exists in all current Master-Slave servers that I use today. MongoDB has the exact same issues with Replication that I see in MySQL.

      The situation is:

      1) Primay is adding/updating data at a rate that is near or at the MAX for the hardware.
      2) Seconday server is starting to LAG due to loads other than Replication. ( other Secondary servers may be OK at the same time. )
      3) Requests for "read" are allowed to provide more read performance.

      Effect:

      The Seconday starts to LAG and does not keep up with the flow of updates and inserts from the Primary. LAG just grows and grows.

      Solution:

      Provide ways for the drivers and applications to "back off" and reduce the impact on the Seconday so it can "catch-up". For example select a Seconday that has lower LAG.

      Ideas:

      1) Add to the slaveOk request a condition of how long the LAG is allowed to be before the request must be performed by some "other" server.
      2) Do not allow requests from servers that are over MAX_SLAVE_LAG set by each servers conf file.
      3) Push back on the Primary if a high load on ALL Seconday servers would stop replication for all known servers. Could be a Read-Only mode.
      4) Stop allowing read requests from servers with high Loads or LAGS.

      I know that the Root Cause of this issue is an overloaded cluster. What I am asking for is a nice easy "push-back" from the MongoDB and not a crash.

      Today a MySQL Slave will also just stop replication if the OpLog runs too long/late. If the OpLog is very huge this situation can LAG for hours/days.

      If the application is OK with a LAG that is Huge, this needs to be allowed. But for applicaitons that require a LAG of say "under 60 sec." a MongoDB feature that helps provide that service would be a great feature.

      The cause of the LAG may also be a Secondary server that is used for backups and was not working for a moment due to a backup request. Today all of the Seconday servers get an equal number of requests.

      Please call any time:
      Cell:     916-202-1600
      Skype:  EdwardMGoldberg

      Edward M. Goldberg
      http://myCloudWatcher.com/
      e.m.g.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                17 Vote for this issue
                Watchers:
                26 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: