[SERVER-16818] Add socket timeout to isSelf replication check Created: 13/Jan/15  Updated: 23/Jan/15  Resolved: 15/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.6, 2.8.0-rc4
Fix Version/s: 3.0.0-rc6

Type: Bug Priority: Minor - P4
Reporter: Joanna Cheng Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-16824 Run isSelf concurrently for all members Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  1. Spin up a 2 node replica set
  2. Send SIGSTOP to one node
  3. Make sure the other one steps down to SECONDARY
  4. rs.status works and should show 1 SECONDARY, 1 "(not reachable/healthy)"
  5. Shut down the node in SECONDARY and then restart the process
  6. Try to issue rs.status(); output is

    > rs.status()
    {
    	"startupStatus" : 1,
    	"ok" : 0,
    	"errmsg" : "loading local.system.replset config (LOADINGCONFIG)"
    }

  7. The socket seems to never time out (3 hours and counting)
Participants:

 Description   

When a mongod starts with --replSet and finds a config in local.system.replset, it will try to establish connections to the other replica set members. It seems that these initial connection attempts are not timed out, which means there is a possibility we might be hung forever waiting for a response from a down replica set member.

By contrast, when an existing up replset member discovers a new replica set member (via rs.add) but the new member is actually uncontactable, the existing member will timeout the connection attempt. This ticket is to request that the initial connection attempts are timed out in the same way.

In the repo given, prior to restarting the mongod, this node is in SECONDARY. It should be able to resume becoming SECONDARY after being restarted.

Note: Adding a third node fixes this problem, it seems we only need a majority of members contacted for the config load to succeed.



 Comments   
Comment by Githook User [ 15/Jan/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-16818: Add socket timeout to isSelf replication check
Branch: master
https://github.com/mongodb/mongo/commit/d28e5b9190ec12a6c70bee6d47eec605ea394862

Comment by Scott Hernandez (Inactive) [ 13/Jan/15 ]

This "repro" seems to be for the shell, not the server behavior. The shell does not have a timeout and it is expected to wait for the system to error or return data for the connection and reads. If that is what you want changed/improved then please open a new issue for the shell and remove that stuff from this issue.

Comment by Joanna Cheng [ 13/Jan/15 ]

Not reproducible in 2.4.12; the node comes back as SECONDARY

In 2.8.0-rc4 my mongo shell just hangs when trying to connect to the restarted node

$ mongo
MongoDB shell version: 2.8.0-rc4
connecting to: test
 

Verbose logs show we're getting stuck on isMaster

2015-01-13T17:31:04.625+1100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:51117 #1 (1 connection now open)
2015-01-13T17:31:04.626+1100 D COMMAND  [conn1] run command admin.$cmd { whatsmyuri: 1 }
2015-01-13T17:31:04.629+1100 I QUERY    [conn1] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0  reslen:62 0ms
2015-01-13T17:31:04.639+1100 D COMMAND  [conn1] run command admin.$cmd { getLog: "startupWarnings" }
2015-01-13T17:31:04.640+1100 D COMMAND  [conn1] command: { getLog: "startupWarnings" }
2015-01-13T17:31:04.641+1100 I QUERY    [conn1] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0  reslen:70 0ms
2015-01-13T17:31:04.646+1100 D COMMAND  [conn1] run command admin.$cmd { replSetGetStatus: 1.0, forShell: 1.0 }
2015-01-13T17:31:04.647+1100 D COMMAND  [conn1] command: { replSetGetStatus: 1.0, forShell: 1.0 }
2015-01-13T17:31:04.648+1100 I QUERY    [conn1] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0  reslen:154 0ms
2015-01-13T17:31:04.649+1100 D COMMAND  [conn1] run command test.$cmd { isMaster: 1.0, forShell: 1.0 }

$ mongo --eval "printjson(db.getSiblingDB('admin').runCommand({ replSetGetStatus: 1.0, forShell: 1.0 }))"
MongoDB shell version: 2.8.0-rc4
connecting to: test
{
	"info" : "run rs.initiate(...) if not yet done for the set",
	"ok" : 0,
	"errmsg" : "no replset config has been received",
	"code" : 94
}
 
$ mongo --eval "db.getSiblingDB('admin').runCommand( {isMaster: 1.0, forShell: 1.0 })"
MongoDB shell version: 2.8.0-rc40 })"
connecting to: test
^C
do you want to kill the current op(s) on the server? (y/n): y

Generated at Thu Feb 08 03:42:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.