Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27952

Replication fails under heavy load - Oplog timeout should be configurable

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.0, 3.4.2
    • Component/s: Replication
    • None
    • ALL

      Renaming of https://jira.mongodb.org/browse/SERVER-19605

      FROM INITIAL REPORT

      We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA SERVER-6733, where the timeout was reduced from 10 minutes to 30 seconds.
      As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.
      Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.

      FROM SUBSEQUENT REPORT

      This just bit us in one of our production replicasets. The primary was sufficiently busy, our oplog sufficiently large, and we inserted enough documents in a short enough period of time, that scanning the table to respond to the secondary started taking not only more than 30 seconds, but upwards of 200 seconds. Our secondary stopped sync'ing because of the receive timeouts and stayed that way until I compiled a custom mongod binary with tcp_timeout set to 600 seconds in oplogreader.h and ran that mongod binary on the secondary, after which it happily started synchronizing with the primary again.

      FROM SUBSEQUENT REPORT

      We have been hitting this problem continually for the last several months. Our cluster is currently 1+3 with 2.6.12 with the biggest baddest AWS IO optimized instance. Our data size is about 3 TB per cluster ingesting about 220-250K records/min....average record size is 500 bytes; range is 200-1000 bytes
      As Jonathan mentions above, ours is also a pretty busy cluster with high ingestion rates and our oplog size matches that. What we find is the following:
      1/ Cluster is happy and replicating with lag < 5 seconds
      2/ Sudden high ingestion happens - that is about 1.1-1.5x ingestion rate at steady happy state.
      3/ getmore requests are delayed - and jumps to above 30 seconds. This causes secondaries to read timeout and start afresh.
      4/ Starting afresh unfortunately seems to cause a complete oplog scan on primary - and it takes in the order of minutes given the ingestion rate. This query also times out on secondaries in 30 seconds.
      5/ Each secondary repeatedly times out and issues a new query every 30 seconds. Primary accumulates all the queries and gets slower and slower.
      Now unless the ingestion is totally stopped, there is no way to recover. We landed with huge number of queries pending on primary - had to stop the entire ingestion pipeline, kill all the long running oplog queries and then things started back up - but cluster is 18 hours behind.
      Observations/Questions:
      1/ The recovery step a secondary takes from a getmore request fails seems to cause even more load on primary. There has to be some design choices here to make to handle these errors more gracefully.
      2/ Oplog full scan queries were increasing in duration forever in the above state (when ingestion was still happening). Is there some tuning that needs to be done to service these queries faster - since they seem to be critical to a secondary to get started?
      3/ Is there any reason why the op log queries from secondary to primary don't have a query level timeout (like using maxTimeMS in 2.6.x) when the underlying connection read time out is 30 seconds?
      In the mean time we will follow Jonathan's approach to bump the hard coded timeout to get more mileage.

      FROM SUBSEQUENT REPORT

      A/ Our "ingestion" in to the cluster includes inserts, updates AND deletes (yes, we are working to rework this part and not do a search and delete).
      B/ When the sum of records (or documents) added/changed/deleted crosses around 370K/min, it alters the getmore responses so much so that it jumps past 30 seconds.
      C/ In the most recent event, with inserts and updates running around 150K/min, a delete operation deleted about 650K records in 132 seconds. That was enough to push getmore response over the 30 second limit.
      In the past sudden surge of inserts also have triggered similar behavior. Once this happened, with deletes continuing for another 10-15 minutes (we have a monitoring script that stops deletes if replication lag increases beyond a threshold) it was enough for oplog full scan queries to take a long time and from there everything went downhill even though all the deletes had stopped and only inserts were happening at a constant clip.

      FROM SUBSEQUENT REPORT

      Shut down all failing secondaries.
      Flushed one "weak" secondary and did full resync.
      On weak secondary: Moved all data indexes to another drive. Also moved local and journal dirs to another drive.
      Kept "strong" (non-failing) secondaries running.
      printSlaveReplicationInfo() on PRIMARY shows this "weak" secondary and all strong secondaries are 0 seconds behind PRIMARY.
      Small operations work fine (del, updt, insert), hundreds or thousands, over hours, no problem.
      So, everything looks fine so far.
      Did truncate on collection having over 100k documents, then immediately a mass document insert (batched 100 documents at a time), probably inserting maybe 500 or 1k documents per second on PRIMARY. This is when the trouble always begins.
      For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary.
      Errors logging on weak secondary log (repeats every 10 seconds) :
      [replication-2] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 13296 - target:10.11.22.33:27017 db:local expDate:2017-02-08T01:25:02.627-0500 cmd:{ getMore: 15975151664, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime:

      { ts: Timestamp 1486464780000|35813, t: 7 }

      }. Last fetched optime (with hash):

      { ts: Timestamp 1486533979000|316, t: 7 }

      [-4870081738431509318]. Restarts remaining: 3

        1. screenshot-1.png
          37 kB
          Søren Laursen

            Assignee:
            Unassigned Unassigned
            Reporter:
            michaelbrenden Michael Brenden
            Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: