Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16818

Add socket timeout to isSelf replication check

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor - P4
    • Resolution: Fixed
    • Affects Version/s: 2.6.6, 2.8.0-rc4
    • Fix Version/s: 3.0.0-rc6
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide
      1. Spin up a 2 node replica set
      2. Send SIGSTOP to one node
      3. Make sure the other one steps down to SECONDARY
      4. rs.status works and should show 1 SECONDARY, 1 "(not reachable/healthy)"
      5. Shut down the node in SECONDARY and then restart the process
      6. Try to issue rs.status(); output is

        > rs.status()
        {
        	"startupStatus" : 1,
        	"ok" : 0,
        	"errmsg" : "loading local.system.replset config (LOADINGCONFIG)"
        }

      7. The socket seems to never time out (3 hours and counting)
      Show
      Spin up a 2 node replica set Send SIGSTOP to one node Make sure the other one steps down to SECONDARY rs.status works and should show 1 SECONDARY, 1 "(not reachable/healthy)" Shut down the node in SECONDARY and then restart the process Try to issue rs.status() ; output is > rs.status() { "startupStatus" : 1, "ok" : 0, "errmsg" : "loading local.system.replset config (LOADINGCONFIG)" } The socket seems to never time out (3 hours and counting)

      Description

      When a mongod starts with --replSet and finds a config in local.system.replset, it will try to establish connections to the other replica set members. It seems that these initial connection attempts are not timed out, which means there is a possibility we might be hung forever waiting for a response from a down replica set member.

      By contrast, when an existing up replset member discovers a new replica set member (via rs.add) but the new member is actually uncontactable, the existing member will timeout the connection attempt. This ticket is to request that the initial connection attempts are timed out in the same way.

      In the repo given, prior to restarting the mongod, this node is in SECONDARY. It should be able to resume becoming SECONDARY after being restarted.

      Note: Adding a third node fixes this problem, it seems we only need a majority of members contacted for the config load to succeed.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              scotthernandez Scott Hernandez
              Reporter:
              joanna.cheng Joanna Cheng
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: