Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28005

Oplog query network timeout is less than the maxTimeMs

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.13, 3.4.3, 3.5.4
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v3.4, v3.2
    • Sprint:
      Repl 2017-02-13, Repl 2017-03-06
    • Case:

      Description

      Currently the initial find for the GTE query on the oplog has a 60 second maxTimeMs, and the subsequent getMores have a maxTimeMs equal to the election timeout / 2. Both the find and the getMore, however, have timeout from the networking subsystem equal to the election timeout. Given the default election timeout is 10 seconds, that means if the initial find takes more than 10 seconds to find the common point in the oplog and return the first batch it will time out, rather than waiting the 60 seconds of the maxTimeMs.

      This can make it hard for nodes that have high repl lag to catch up, as if the common point in the oplog is far back, it could consistently take more than 10 seconds, which would leave the node unable to start replicating.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: