Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-19605

Oplog timeout should be configurable

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.6.7, 3.0.4
    • Fix Version/s: 3.4.11, 3.6.0-rc0
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v3.4
    • Sprint:
      Repl 2017-10-02
    • Case:
    • Linked BF Score:
      0

      Description

      We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA SERVER-6733, where the timeout was reduced from 10 minutes to 30 seconds.

      As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.

      Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.

      We have a fairly large oplog:

      > db.printReplicationInfo()
      configured oplog size:   143477.3826171875MB
      log length start to end: 1620689secs (450.19hrs)
      oplog first event time:  Wed Jul 08 2015 23:11:24 GMT+0000 (UTC)
      oplog last event time:   Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)
      now:                     Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)
      

      Here are some sample queries issued by the secondaries that are timing out:

      Mon Jul 27 16:32:44.469 [conn5987144] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368021807027379 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33130 locks(micros) r:38390680 nreturned:101 reslen:25310 1361497ms
      Mon Jul 27 16:32:45.037 [conn5987146] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368020207769978 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33131 locks(micros) r:38186447 nreturned:101 reslen:25310 1362020ms
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                9 Vote for this issue
                Watchers:
                36 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: