Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15799

Replication threads stuck in infinite connection loop when there is not enough remote oplog

    • Type: Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 2.6.5
    • Component/s: Replication
    • Labels:
      None
    • ALL

      One of secondary members of our replica set had to be taken offline for a prolonged period. Unfortunately, the oplog on master was not long enough for a proper recovery. But, instead of getting the usual message in the error log (about oplog too short to recover), the secondary started creating a lot of connections to primary at a rate of about 300 per second, until it exhausted local ports (due to 30K of connections to the same remote port, hanging in TIME_WAIT) and then started losing heartbeats due to inability to connect to primary server at all, filling log with these messages:

      2014-10-24T17:07:51.063+0400 [rsBackgroundSync] warning: Failed to connect to 10.3.1.12:27032, reason: errno:99 Cannot assign requested address
      2014-10-24T17:07:51.064+0400 [rsBackgroundSync] repl: couldn't connect to server d1.s2.fs-temp.drive.bru:27032 (10.3.1.12), connection attempt failed

        1. primary.log.bz2
          42.27 MB
        2. secondary.log.bz2
          310 kB

            Assignee:
            matt.dannenberg Matt Dannenberg
            Reporter:
            onyxmaster Aristarkh Zagorodnikov
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: