Details
Description
ReplicaSetMonitor::getHostOrRefresh will repeatably attempt to refresh the replica set connections until it finds a host that responds with a match to its ReadPreferenceSetting. It does this in a while loop, until timeout is reached. The logic ordering in the while loop is such that it will fail to find a matching host and then check for a timeout before searching again.
The problem is that it goes through the connections serially, and hangs for 5 seconds on a connection that doesn't respond.
The scenario is which this is an issue:
- Command is issued with read pref mode "primary"
- Config primary (20267) goes down
- getHostOrRefresh is called, but can't get host matching primary, keeps retrying the connections...
.... keeps retrying for a time, and then: - getHostOrRefresh checks first host 20266: not primary
- Config 20266 becomes primary.
- getHostOrRefresh checks next host 20267, gets stuck for 5 seconds because 20267 is still down.
- getHostOrRefresh checks next host 20268, not primary
- getHostOrRefresh checks for timeout, and exits,
- Command returns error/failure.
Command fails when there has been a config primary up and working for the past 5 seconds. Not desirable.
Relevant code in ReplicaSetMonitor::getHostOrRefresh:
...
|
while (true) {
|
// We might not have found any matching hosts due to the scan, which just completed may have
|
// seen stale data from before we joined. Therefore we should participate in a new scan to
|
// make sure all hosts are contacted at least once (possibly by other threads) before this
|
// function gives up.
|
Refresher refresher(startOrContinueRefresh());
|
|
|
HostAndPort out = refresher.refreshUntilMatches(criteria);
|
if (!out.empty())
|
return out;
|
|
|
if (!isSetUsable()) {
|
return Status(ErrorCodes::ReplicaSetNotFound,
|
str::stream() << "None of the hosts for replica set " << getName()
|
<< " could be contacted.");
|
}
|
|
|
const Milliseconds remaining = maxWait - (Date_t::now() - startTimeMs);
|
|
|
if (remaining < kFindHostMaxBackOffTime) {
|
break;
|
}
|
|
|
// Back-off so we don't spam the replica set hosts too much
|
sleepFor(kFindHostMaxBackOffTime);
|
}
|
...
|
Relevant failure logs
|
|
[js_test:recovering_slaveok] 2016-03-18T22:30:36.605+0000 s20269| 2016-03-18T22:30:30.699+0000 D ASIO [conn1] startCommand: RemoteCommand 169 -- target:ip-10-63-79-92:20267 db:config expDate:2016-03-18T22:31:00.699+0000 cmd:{ update: "collections", updates: [ { q: { _id: "test.foo" }, u: { _id: "test.foo", lastmodEpoch: ObjectId('56ec81841e56700a43aaf54d'), lastmod: new Date(4294967296), dropped: false, key: { _id: 1.0 }, unique: false }, multi: false, upsert: true } ], writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:48.839+0000 c20267| 2016-03-18T22:30:30.714+0000 I COMMAND [conn13] Attempting to step down in response to replSetStepDown command
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.313+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.316+0000 s20269| 2016-03-18T22:30:37.484+0000 D - [conn1] User Assertion: 6:network error while attempting to run command 'ismaster' on host 'ip-10-63-79-92:20267'
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK [conn1] Detected bad connection created at 1458340230761664 microSec, clearing pool for ip-10-63-79-92:20267 of 0 connections
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 D NETWORK [conn1] Marking host ip-10-63-79-92:20267 as failed
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 W NETWORK [conn1] No primary detected for set recovering_slaveok-configRS
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK [conn1] Starting new refresh of replica set recovering_slaveok-configRS
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK [conn1] creating new connection to:ip-10-63-79-92:20267 <- starts contacting hosts and gets stuck
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:43.823+0000 c20266| 2016-03-18T22:30:37.985+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted <- new primary is up
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK [Balancer] polling for status of connection to 10.63.79.92:20268, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:37.985+0000 D NETWORK [Balancer] polling for status of connection to 10.63.79.92:20266, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.018+0000 D COMMAND [ConnectBG] BackgroundJob starting: ConnectBG
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.024+0000 D NETWORK [conn1] connected to server ip-10-63-79-92:20267 (10.63.79.92)
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.320+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs1
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs1
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20263, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20264, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20265, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs0
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs0
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20260, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20261, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20262, no events
|
[js_test:recovering_slaveok] 2016-03-18T22:30:40.352+0000 s20269| 2016-03-18T22:30:38.788+0000 D NETWORK [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-configRS
|
....
|
.... 5 seconds pass before hearing from s20269 again
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.664+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK [conn1] Socket recv() timeout 10.63.79.92:20267
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.665+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D - [conn1] User Assertion: 6:network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D - [conn1] User Assertion: 13328:connection pool: connect failed ip-10-63-79-92:20267 : network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK [conn1] Marking host ip-10-63-79-92:20267 as failed
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 W NETWORK [conn1] No primary detected for set recovering_slaveok-configRS
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D - [conn1] User Assertion: 133:collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set recovering_slaveok-configRS
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK [conn1] Starting new refresh of replica set recovering_slaveok-configRS
|
[js_test:recovering_slaveok] 2016-03-18T22:30:44.680+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK [conn1] creating new connection to:ip-10-63-79-92:20267
|
....
|
....
|
....
|
[js_test:recovering_slaveok] 2016-03-18T22:30:45.247+0000 {
|
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 "code" : 133,
|
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 "ok" : 0,
|
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 "errmsg" : "collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: \"primary\" } for set recovering_slaveok-configRS"
|
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 }
|