Loading...

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 4.1.11
Affects Version/s: None
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Service Arch 2019-01-28, Service Arch 2019-02-11, Service Arch 2019-02-25, Service Arch 2019-03-11, Service Arch 2019-04-08, Service Arch 2019-04-22
Linked BF Score:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

ReplicaSetMonitor::getHostOrRefresh will repeatably attempt to refresh the replica set connections until it finds a host that responds with a match to its ReadPreferenceSetting. It does this in a while loop, until timeout is reached. The logic ordering in the while loop is such that it will fail to find a matching host and then check for a timeout before searching again.

The problem is that it goes through the connections serially, and hangs for 5 seconds on a connection that doesn't respond.

The scenario is which this is an issue:

Command is issued with read pref mode "primary"
Config primary (20267) goes down
getHostOrRefresh is called, but can't get host matching primary, keeps retrying the connections...
.... keeps retrying for a time, and then:
getHostOrRefresh checks first host 20266: not primary
Config 20266 becomes primary.
getHostOrRefresh checks next host 20267, gets stuck for 5 seconds because 20267 is still down.
getHostOrRefresh checks next host 20268, not primary
getHostOrRefresh checks for timeout, and exits,
Command returns error/failure.

Command fails when there has been a config primary up and working for the past 5 seconds. Not desirable.

Relevant code in ReplicaSetMonitor::getHostOrRefresh:

...
    while (true) {
        // We might not have found any matching hosts due to the scan, which just completed may have
        // seen stale data from before we joined. Therefore we should participate in a new scan to
        // make sure all hosts are contacted at least once (possibly by other threads) before this
        // function gives up.
        Refresher refresher(startOrContinueRefresh());

        HostAndPort out = refresher.refreshUntilMatches(criteria);
        if (!out.empty())
            return out;

        if (!isSetUsable()) {
            return Status(ErrorCodes::ReplicaSetNotFound,
                          str::stream() << "None of the hosts for replica set " << getName()
                                        << " could be contacted.");
        }

        const Milliseconds remaining = maxWait - (Date_t::now() - startTimeMs);

        if (remaining < kFindHostMaxBackOffTime) {
            break;
        }

        // Back-off so we don't spam the replica set hosts too much
        sleepFor(kFindHostMaxBackOffTime);
    }
...

Relevant failure logs


[js_test:recovering_slaveok] 2016-03-18T22:30:36.605+0000 s20269| 2016-03-18T22:30:30.699+0000 D ASIO     [conn1] startCommand: RemoteCommand 169 -- target:ip-10-63-79-92:20267 db:config expDate:2016-03-18T22:31:00.699+0000 cmd:{ update: "collections", updates: [ { q: { _id: "test.foo" }, u: { _id: "test.foo", lastmodEpoch: ObjectId('56ec81841e56700a43aaf54d'), lastmod: new Date(4294967296), dropped: false, key: { _id: 1.0 }, unique: false }, multi: false, upsert: true } ], writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
....
[js_test:recovering_slaveok] 2016-03-18T22:30:48.839+0000 c20267| 2016-03-18T22:30:30.714+0000 I COMMAND  [conn13] Attempting to step down in response to replSetStepDown command
....
[js_test:recovering_slaveok] 2016-03-18T22:30:40.313+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
[js_test:recovering_slaveok] 2016-03-18T22:30:40.316+0000 s20269| 2016-03-18T22:30:37.484+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'ismaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] Detected bad connection created at 1458340230761664 microSec, clearing pool for ip-10-63-79-92:20267 of 0 connections
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267                                                     <- starts contacting hosts and gets stuck
....
[js_test:recovering_slaveok] 2016-03-18T22:30:43.823+0000 c20266| 2016-03-18T22:30:37.985+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted                                  <- new primary is up
....
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20268, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:37.985+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20266, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.018+0000 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.024+0000 D NETWORK  [conn1] connected to server ip-10-63-79-92:20267 (10.63.79.92)
[js_test:recovering_slaveok] 2016-03-18T22:30:40.320+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs1
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs1
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20263, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20264, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20265, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs0
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs0
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20260, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20261, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20262, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.352+0000 s20269| 2016-03-18T22:30:38.788+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-configRS
....
.... 5 seconds pass before hearing from s20269 again
....
[js_test:recovering_slaveok] 2016-03-18T22:30:44.664+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] Socket recv() timeout  10.63.79.92:20267
[js_test:recovering_slaveok] 2016-03-18T22:30:44.665+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 13328:connection pool: connect failed ip-10-63-79-92:20267 : network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 133:collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.680+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267
....
....
....
[js_test:recovering_slaveok] 2016-03-18T22:30:45.247+0000 {
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "code" : 133,
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "ok" : 0,
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "errmsg" : "collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: \"primary\" } for set recovering_slaveok-configRS"
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 }

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

recovering_slaveok_test_failure_logs.txt
5.30 MB
Mar 25 2016 07:24:55 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates