Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-62699

Replica set fails to restart after shutdown of all Nodes in a Dynamic DNS/network environment

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.14, 4.4.7
    • Component/s: None
    • Cluster Scalability
    • ALL
    • Hide
      1. Stop all nodes in a MongoDB replica setup.
      2. Remove DNS entries related to the MongoDB nodes.
      3. Start all MongoDB nodes in replica setup.
      4. Nodes will fail isSelf test and enter REMOVED state.
      5. Add the relevant DNS entries in your network.
      6. MongoDB nodes still remain in REMOVED state and does not retry isSelf test.
      Show
      Stop all nodes in a MongoDB replica setup. Remove DNS entries related to the MongoDB nodes. Start all MongoDB nodes in replica setup. Nodes will fail isSelf test and enter REMOVED state. Add the relevant DNS entries in your network. MongoDB nodes still remain in REMOVED state and does not retry isSelf test.

      We have recently moved to a MongoDB cluster setup from a single node setup.

      We have the dynamic network setup where the DNS entries activate after about 30 seconds a service(MongoDB Node) is started. Each service/node has a unique IP allocated after start/restart.

      Lets say my Mongo set up has 3 Nodes rs1,rs2,rs3.. Here is whats happening..

      1. When i shutdown rs1,rs2 and rs3 together, the DNS records for are removed.
      2. After restart, each MongoDB node does a isSelf test, to find its hostname. isSelf test is simply MongoDB node connecting to each of the members in replica set config to find itself.
      3. Due to above network setup delays isSelf test fails and the node enters REMOVED state.

      {"t":{"$date":"2022-01-18T15:57:16.513+05:30"},"s":"I", "c":"NETWORK", "id":4834700, "ctx":"ReplCoord-0","msg":"isSelf could not connect via connectSocketOnly","attr":\{"hostAndPort":"rs1.example.com:12000","error":{"code":6,"codeName":"HostUnreachable","errmsg":"couldn't connect to server rs1.example.com:12000, connection attempt failed: HostNotFound: Could not find address for rs1.example.com:12000: SocketException: Host not found (authoritative)"}}}
      {"t":{"$date":"2022-01-18T15:57:16.513+05:30"},"s":"I", "c":"REPL", "id":21394, "ctx":"ReplCoord-0","msg":"This node is not a member of the config"}
      {"t":{"$date":"2022-01-18T15:57:16.513+05:30"},"s":"I", "c":"REPL", "id":21358, "ctx":"ReplCoord-0","msg":"Replica set state transition","attr":{"newState":"REMOVED","oldState":"STARTUP"}}
      

      4. Once a node enters REMOVED state it does not try to find other nodes and initiate cluster formation/election. So, MongoDB cluster formation fails after each restart of all Nodes.

      I think dynamic DNS environments like ours are seen in other Organizations also. Isn't it better to keep retrying isSelf test even after entering REMOVED state?

       

        1. mongo_replicaset_error_reproduce_steps.txt
          11 kB
        2. data.zip
          1.94 MB
        3. logs.zip
          69 kB

            Assignee:
            garaudy.etienne@mongodb.com Garaudy Etienne
            Reporter:
            g.sravan4u@gmail.com Sravan _
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: