[SERVER-31580] Replica Set configuration Created: 16/Oct/17  Updated: 24/Feb/21  Resolved: 27/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Sungbeom Cho Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

[Server log]

2017-10-16T06:30:30.988100711Z 2017-10-16T06:30:30.987+0000 I REPL     [conn1] replSetReconfig admin command received from client
2017-10-16T06:31:01.151444478Z 2017-10-16T06:31:01.151+0000 I NETWORK  [conn1] Socket recv() timeout  X.X.X.X:27017
2017-10-16T06:31:01.152286316Z 2017-10-16T06:31:01.151+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [X.X.X.X:27017]
2017-10-16T06:31:01.152313290Z 2017-10-16T06:31:01.151+0000 I NETWORK  [conn1] can't authenticate to node2:27017 (X.X.X.X) failed as internal user, error: network error while attempting to run command 'saslStart' on host 'node2:27017'
2017-10-16T06:31:01.152324083Z 2017-10-16T06:31:01.151+0000 I REPL     [conn1] replSetReconfig config object with 2 members parses ok
2017-10-16T06:31:01.152332099Z 2017-10-16T06:31:01.151+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to node2:27017
2017-10-16T06:31:11.152173626Z 2017-10-16T06:31:11.151+0000 W REPL     [ReplicationExecutor] Failed to complete heartbeat request to node2:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2017-10-16T06:31:11.152294763Z 2017-10-16T06:31:11.152+0000 E REPL     [conn1] replSetReconfig failed; NodeNotFound: Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: node1:27017; the following nodes did not respond affirmatively: node2:27017 failed with Couldn't get a connection within the time limit
2017-10-16T06:31:11.153480974Z 2017-10-16T06:31:11.152+0000 I COMMAND  [conn1] command admin.$cmd appName: "MongoDB Shell" command: replSetReconfig { replSetReconfig: { _id: "my_replica", version: 2, protocolVersion: 1, members: [ { _id: 0, host: "node1:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1.0, host: "node2:27017" } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 2000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59e446098e9cfb134480aa4f') } } } numYields:0 reslen:330 locks:{ Global: { acquireCount: { r: 1, W: 1 } } } protocol:op_command 40164ms
2017-10-16T06:31:21.152196585Z 2017-10-16T06:31:21.151+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to node2:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 66 -- target:node2:27017 db:admin cmd:{ isMaster: 1 }
2017-10-16T06:31:21.152260029Z 2017-10-16T06:31:21.151+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to node2:27017 due to failed operation on a connection

[Message]

my_replica:PRIMARY> rs.add("node2:27017")
{
        "ok" : 0,
        "errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: node1:27017; the following nodes did not respond affirmatively: node2:27017 failed with Couldn't get a connection within the time limit",
        "code" : 74,
        "codeName" : "NodeNotFound"
}



 Comments   
Comment by Sumit Kandoi [ 24/Feb/21 ]

Hello Mark,

 

In above comments as you mentioned changing/increasing value of heartbeatTimeOutSecs can resolve this time limit issue.

I want to understand default value of heartbeatTimeOutSecs = 10 secs. let's say if i increase value to 15secs then what can be side effects and benefits of this new value 

Comment by Sungbeom Cho [ 23/Oct/17 ]

Hi Mark,
Thank you for your help!
My problem has been solved now after I changed that parameter.
You can close this ticket.

Thanks,
Sungbeom

Comment by Mark Agarunov [ 20/Oct/17 ]

Hello sungbeom,

Thank you for the additional information. You can adjust the heartbeat timeout by changing the heartbeatTimeoutSecs parameter in the replicaset configuration. If the problem is still present after making this change, please provide the complete logs and we can continue investigating.

Thanks,
Mark

Comment by Sungbeom Cho [ 17/Oct/17 ]

Hi Mark,
Thank you for your help.

1. When trying to connect to node2, is the connection attempt logged on node2?
-> No, I checked log from node2 while node1(Primary) is trying to connect to node2. But nothing happened in node2.

2. Are you able to connect to node2 from the machine the primary node is on from the command line? (ex mongo --host node2 --port 27017)
-> Yes, it worked well.

3. Could you please provide the complete logs from all affected mongod nodes?
-> Now, I already reset servers, I will update it after restoring DB server.

One thing I suppose is that this problem might occur because of time limit. Would you let me know how to increase the time limit?
I guess that configuration should be in Replica set configuration.

Thanks,
Sungbeom.

Comment by Mark Agarunov [ 17/Oct/17 ]

Hello sungbeom,

Thank you for the report. To get a better idea of what may be causing this issue, I'd like to ask a few questions.

  • When trying to connect to node2, is the connection attempt logged on node2?
  • Are you able to connect to node2 from the machine the primary node is on from the command line? (ex mongo --host node2 --port 27017)
  • Could you please provide the complete logs from all affected mongod nodes?

Thanks,
Mark

Comment by Sungbeom Cho [ 16/Oct/17 ]

Hi, all
I tried to make a replica set but failed.
I guess that's because of distance between primary(in US) and secondary(in Korea).
But I need to know the exact cause to fix this..

That is log message I get when I try "rs.add("node2")" in primary.

Generated at Thu Feb 08 04:27:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.