[SERVER-23355] ReplicaSetMonitor: getHostOrRefresh hangs on non-responsive hosts, leading to timeout and inappropriate failures Created: 25/Mar/16  Updated: 08/Jan/24  Resolved: 15/Apr/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.1.11

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: A. Jesse Jiryu Davis
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File recovering_slaveok_test_failure_logs.txt    
Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch 2019-01-28, Service Arch 2019-02-11, Service Arch 2019-02-25, Service Arch 2019-03-11, Service Arch 2019-04-08, Service Arch 2019-04-22
Participants:
Linked BF Score: 0

 Description   

ReplicaSetMonitor::getHostOrRefresh will repeatably attempt to refresh the replica set connections until it finds a host that responds with a match to its ReadPreferenceSetting. It does this in a while loop, until timeout is reached. The logic ordering in the while loop is such that it will fail to find a matching host and then check for a timeout before searching again.

The problem is that it goes through the connections serially, and hangs for 5 seconds on a connection that doesn't respond.

The scenario is which this is an issue:

  • Command is issued with read pref mode "primary"
  • Config primary (20267) goes down
  • getHostOrRefresh is called, but can't get host matching primary, keeps retrying the connections...
    .... keeps retrying for a time, and then:
  • getHostOrRefresh checks first host 20266: not primary
  • Config 20266 becomes primary.
  • getHostOrRefresh checks next host 20267, gets stuck for 5 seconds because 20267 is still down.
  • getHostOrRefresh checks next host 20268, not primary
  • getHostOrRefresh checks for timeout, and exits,
  • Command returns error/failure.

Command fails when there has been a config primary up and working for the past 5 seconds. Not desirable.

Relevant code in ReplicaSetMonitor::getHostOrRefresh:

...
    while (true) {
        // We might not have found any matching hosts due to the scan, which just completed may have
        // seen stale data from before we joined. Therefore we should participate in a new scan to
        // make sure all hosts are contacted at least once (possibly by other threads) before this
        // function gives up.
        Refresher refresher(startOrContinueRefresh());
 
        HostAndPort out = refresher.refreshUntilMatches(criteria);
        if (!out.empty())
            return out;
 
        if (!isSetUsable()) {
            return Status(ErrorCodes::ReplicaSetNotFound,
                          str::stream() << "None of the hosts for replica set " << getName()
                                        << " could be contacted.");
        }
 
        const Milliseconds remaining = maxWait - (Date_t::now() - startTimeMs);
 
        if (remaining < kFindHostMaxBackOffTime) {
            break;
        }
 
        // Back-off so we don't spam the replica set hosts too much
        sleepFor(kFindHostMaxBackOffTime);
    }
...

Relevant failure logs

 
[js_test:recovering_slaveok] 2016-03-18T22:30:36.605+0000 s20269| 2016-03-18T22:30:30.699+0000 D ASIO     [conn1] startCommand: RemoteCommand 169 -- target:ip-10-63-79-92:20267 db:config expDate:2016-03-18T22:31:00.699+0000 cmd:{ update: "collections", updates: [ { q: { _id: "test.foo" }, u: { _id: "test.foo", lastmodEpoch: ObjectId('56ec81841e56700a43aaf54d'), lastmod: new Date(4294967296), dropped: false, key: { _id: 1.0 }, unique: false }, multi: false, upsert: true } ], writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
....
[js_test:recovering_slaveok] 2016-03-18T22:30:48.839+0000 c20267| 2016-03-18T22:30:30.714+0000 I COMMAND  [conn13] Attempting to step down in response to replSetStepDown command
....
[js_test:recovering_slaveok] 2016-03-18T22:30:40.313+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
[js_test:recovering_slaveok] 2016-03-18T22:30:40.316+0000 s20269| 2016-03-18T22:30:37.484+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'ismaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 I NETWORK  [conn1] Detected bad connection created at 1458340230761664 microSec, clearing pool for ip-10-63-79-92:20267 of 0 connections
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
[js_test:recovering_slaveok] 2016-03-18T22:30:40.317+0000 s20269| 2016-03-18T22:30:37.484+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267                                                     <- starts contacting hosts and gets stuck
....
[js_test:recovering_slaveok] 2016-03-18T22:30:43.823+0000 c20266| 2016-03-18T22:30:37.985+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted                                  <- new primary is up
....
[js_test:recovering_slaveok] 2016-03-18T22:30:40.318+0000 s20269| 2016-03-18T22:30:37.984+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20268, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:37.985+0000 D NETWORK  [Balancer] polling for status of connection to 10.63.79.92:20266, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.018+0000 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
[js_test:recovering_slaveok] 2016-03-18T22:30:40.319+0000 s20269| 2016-03-18T22:30:38.024+0000 D NETWORK  [conn1] connected to server ip-10-63-79-92:20267 (10.63.79.92)
[js_test:recovering_slaveok] 2016-03-18T22:30:40.320+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs1
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs1
[js_test:recovering_slaveok] 2016-03-18T22:30:40.348+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20263, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20264, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.349+0000 s20269| 2016-03-18T22:30:38.786+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20265, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-rs0
[js_test:recovering_slaveok] 2016-03-18T22:30:40.350+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] Starting new refresh of replica set recovering_slaveok-rs0
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20260, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20261, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.351+0000 s20269| 2016-03-18T22:30:38.787+0000 D NETWORK  [ReplicaSetMonitorWatcher] polling for status of connection to 10.63.79.92:20262, no events
[js_test:recovering_slaveok] 2016-03-18T22:30:40.352+0000 s20269| 2016-03-18T22:30:38.788+0000 D NETWORK  [ReplicaSetMonitorWatcher] checking replica set: recovering_slaveok-configRS
....
.... 5 seconds pass before hearing from s20269 again
....
[js_test:recovering_slaveok] 2016-03-18T22:30:44.664+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] Socket recv() timeout  10.63.79.92:20267
[js_test:recovering_slaveok] 2016-03-18T22:30:44.665+0000 s20269| 2016-03-18T22:30:43.023+0000 I NETWORK  [conn1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.63.79.92:20267]
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 6:network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:44.677+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 13328:connection pool: connect failed ip-10-63-79-92:20267 : network error while attempting to run command 'isMaster' on host 'ip-10-63-79-92:20267'
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Marking host ip-10-63-79-92:20267 as failed
[js_test:recovering_slaveok] 2016-03-18T22:30:44.678+0000 s20269| 2016-03-18T22:30:43.024+0000 W NETWORK  [conn1] No primary detected for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D -        [conn1] User Assertion: 133:collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: "primary" } for set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.679+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] Starting new refresh of replica set recovering_slaveok-configRS
[js_test:recovering_slaveok] 2016-03-18T22:30:44.680+0000 s20269| 2016-03-18T22:30:43.024+0000 D NETWORK  [conn1] creating new connection to:ip-10-63-79-92:20267
....
....
....
[js_test:recovering_slaveok] 2016-03-18T22:30:45.247+0000 {
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "code" : 133,
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "ok" : 0,
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000   "errmsg" : "collection metadata write failed :: caused by :: FailedToSatisfyReadPreference: could not find host matching read preference { mode: \"primary\" } for set recovering_slaveok-configRS"
[js_test:recovering_slaveok] 2016-03-18T22:30:45.248+0000 }



 Comments   
Comment by Githook User [ 15/Apr/19 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: SERVER-23355 Recheck replicas every .5 sec during scan
Branch: master
https://github.com/mongodb/mongo/commit/195271e1c8254f26636331b9e07067e2eab16374

Comment by Mathias Stearn [ 25/Jan/19 ]

greg.mckeon dianna.hohensee The steps are now different after the recent changes to make the RSM async, but I think the observed behavior is the same. The new sequence is:

  1. Launch a scan of all hosts in parallel
  2. Hosts 1 and 2 quickly report not master
  3. Every 500ms (the "expedited period") we will ensure that a scan is in progress, but not launch a new one yet because there is
  4. Host 1 becomes primary
  5. After 5 seconds host 3 will timeout the connect and be considered down
  6. At the next scan (at most 500ms later) we will launch the parallel scan again
  7. Host 1 quickly reports that it is primary
  8. Return Host 1 to the caller

If we want to fix this, we would need to re-contact the hosts that have already replied on the 500ms expedited period. This is probably doable, but would require care since the RSM is currently designed to contact each host exactly once.

Generated at Thu Feb 08 04:03:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.