Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15799

Replication threads stuck in infinite connection loop when there is not enough remote oplog

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Minor - P4 Minor - P4
    • None
    • 2.6.5
    • Replication
    • None
    • ALL

    Description

      One of secondary members of our replica set had to be taken offline for a prolonged period. Unfortunately, the oplog on master was not long enough for a proper recovery. But, instead of getting the usual message in the error log (about oplog too short to recover), the secondary started creating a lot of connections to primary at a rate of about 300 per second, until it exhausted local ports (due to 30K of connections to the same remote port, hanging in TIME_WAIT) and then started losing heartbeats due to inability to connect to primary server at all, filling log with these messages:

      2014-10-24T17:07:51.063+0400 [rsBackgroundSync] warning: Failed to connect to 10.3.1.12:27032, reason: errno:99 Cannot assign requested address
      2014-10-24T17:07:51.064+0400 [rsBackgroundSync] repl: couldn't connect to server d1.s2.fs-temp.drive.bru:27032 (10.3.1.12), connection attempt failed

      Attachments

        1. primary.log.bz2
          42.27 MB
        2. secondary.log.bz2
          310 kB

        Activity

          People

            matt.dannenberg Matt Dannenberg
            onyxmaster Aristarkh Zagorodnikov
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: