Core Server
  1. Core Server
  2. SERVER-3346

MAX SLAVE LAG - Features to provide a more stable Replication Set under high load

    Details

    • Type: Improvement Improvement
    • Status: Open Open
    • Priority: Major - P3 Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 1.8.1
    • Fix Version/s: planned but not scheduled
    • Component/s: Replication/Pairing
    • Environment:
      A collection of servers PRIMARY and SECONDAY servers, where the query load is high on the SECONDAY servers and replication load is also very high.
    • Backport:
      No
    • # Replies:
      3
    • Last comment by Customer:
      true

      Description

      This problem exists in all current Master-Slave servers that I use today. MongoDB has the exact same issues with Replication that I see in MySQL.

      The situation is:

      1) Primay is adding/updating data at a rate that is near or at the MAX for the hardware.
      2) Seconday server is starting to LAG due to loads other than Replication. ( other Secondary servers may be OK at the same time. )
      3) Requests for "read" are allowed to provide more read performance.

      Effect:

      The Seconday starts to LAG and does not keep up with the flow of updates and inserts from the Primary. LAG just grows and grows.

      Solution:

      Provide ways for the drivers and applications to "back off" and reduce the impact on the Seconday so it can "catch-up". For example select a Seconday that has lower LAG.

      Ideas:

      1) Add to the slaveOk request a condition of how long the LAG is allowed to be before the request must be performed by some "other" server.
      2) Do not allow requests from servers that are over MAX_SLAVE_LAG set by each servers conf file.
      3) Push back on the Primary if a high load on ALL Seconday servers would stop replication for all known servers. Could be a Read-Only mode.
      4) Stop allowing read requests from servers with high Loads or LAGS.

      I know that the Root Cause of this issue is an overloaded cluster. What I am asking for is a nice easy "push-back" from the MongoDB and not a crash.

      Today a MySQL Slave will also just stop replication if the OpLog runs too long/late. If the OpLog is very huge this situation can LAG for hours/days.

      If the application is OK with a LAG that is Huge, this needs to be allowed. But for applicaitons that require a LAG of say "under 60 sec." a MongoDB feature that helps provide that service would be a great feature.

      The cause of the LAG may also be a Secondary server that is used for backups and was not working for a moment due to a backup request. Today all of the Seconday servers get an equal number of requests.

      Please call any time:
      Cell:     916-202-1600
      Skype:  EdwardMGoldberg

      Edward M. Goldberg
      http://myCloudWatcher.com/
      e.m.g.

        Issue Links

          Activity

          Hide
          Christian Ribe
          added a comment -

          Indeed, managing replication lag needs to be added.
          There is no easy way to handle lag currently.
          +1

          Show
          Christian Ribe
          added a comment - Indeed, managing replication lag needs to be added. There is no easy way to handle lag currently. +1
          Hide
          Colin Howe
          added a comment -

          This would be awesome. We are hosted on EC2 and sometimes are slaves get a little bit more behind than normal due to the hardware playing up. If that happens we want to stop the reads automatically. Currently, we're thinking of a watch process that changes the slave to 'hidden' if the lag goes over a certain value..

          Show
          Colin Howe
          added a comment - This would be awesome. We are hosted on EC2 and sometimes are slaves get a little bit more behind than normal due to the hardware playing up. If that happens we want to stop the reads automatically. Currently, we're thinking of a watch process that changes the slave to 'hidden' if the lag goes over a certain value..
          Hide
          Juho Mäkinen
          added a comment -

          A slave can be lagging behind for example when it has been resurrected from backup. Currently DBA needs to alter the configuration and mark the slave to be hidden until it has caught up, a task which this would make obsolete.

          +1

          Show
          Juho Mäkinen
          added a comment - A slave can be lagging behind for example when it has been resurrected from backup. Currently DBA needs to alter the configuration and mark the slave to be hidden until it has caught up, a task which this would make obsolete. +1

            People

            • Votes:
              13 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Days since reply:
                32 weeks, 2 days ago
                Date of 1st Reply: