[SERVER-17555] The heartbeat attempts connection multiple time (when other node is down) in way that is not intended Created: 11/Mar/15  Updated: 29/Apr/15  Resolved: 29/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.5
Fix Version/s: 3.0.0

Type: Improvement Priority: Minor - P4
Reporter: Anil Kumar Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Minor Change
Participants:

 Description   

When the other side of the heartbeat channel goes down, the RS heartbeat attempts connection multiple time in a way that may not be intended and also causes log flooding for these failures as a result. This may be ok when a node goes out for a small duration but does not seem to be very intuitive when a longer maintenance task needs to be performed on the replica-set nodes.

This is evident from the following code instrumented with additional log statements (src/mongo/db/repl/heartbeat.cpp):

364         bool _requestHeartbeat(HeartbeatInfo& mem, BSONObj& info, int& theirConfigVersion) {
365             log() << "_requestHeartbeat - entering function";
366             {
367                 log() << "_requestHeartbeat - initial scoped connection check " << __FILE__ << ":" << __LINE__;
368                 ScopedConn conn(h.toString());
369                 conn.setTimeout(_timeout);
370                 if (tries++ % threshold == (threshold - 1)) {
371                     log() << "_requestHeartbeat - reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
372                     conn.reconnect();
373                     log() << "_requestHeartbeat - after failed reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
374                 }
375             }
376 
377             Timer timer;
378             time_t before = curTimeMicros64() / 1000000;
379 
380             bool ok = tryHeartbeat(&info, &theirConfigVersion);
381 
382             mem.ping = static_cast<unsigned int>(timer.millis());
383             time_t totalSecs = mem.ping / 1000;
384 
385             // if that didn't work and we have more time, lower timeout and try again
386             if (!ok && totalSecs < _timeout) {
387                 log() << "replset info " << h.toString() << " heartbeat failed, retrying" << rsLog;
388 
389                 // lower timeout to remaining ping time
390                 {
391                     log() << "_requestHeartbeat - getting scoped connection again for setting timeout " << __FILE__ << ":" << __LINE__;
392                     ScopedConn conn(h.toString());
393                     conn.setTimeout(_timeout - totalSecs);
394                 }
395 
396                 int checkpoint = timer.millis();
397                 timer.reset();
398                 ok = tryHeartbeat(&info, &theirConfigVersion);
399                 mem.ping = static_cast<unsigned int>(timer.millis());
400                 totalSecs = (checkpoint + mem.ping)/1000;
401 
402                 // set timeout back to default
403                 {
404                     log() << "_requestHeartbeat - getting scoped connection again for resetting timeout " << __FILE__ << ":" << __LINE__;
405                     ScopedConn conn(h.toString());
406                     conn.setTimeout(_timeout);
407                 }
408             }
409 
410             // we set this on any response - we don't get this far if
411             // couldn't connect because exception is thrown
412             time_t after = mem.lastHeartbeat = before + totalSecs;
413 
414             if ( info["time"].isNumber() ) {
415                 long long t = info["time"].numberLong();
416                 if( t > after )
417                     mem.skew = (int) (t - after);
418                 else if( t < before )
419                     mem.skew = (int) (t - before); // negative
420             }
421             else {
422                 // it won't be there if remote hasn't initialized yet
423                 if( info.hasElement("time") )
424                     warning() << "heatbeat.time isn't a number: " << info << endl;
425                 mem.skew = INT_MIN;
426             }
427 
428             {
429                 be state = info["state"];
430                 if( state.ok() )
431                     mem.hbstate = MemberState(state.Int());
432             }
433 
434             if (info.hasField("stateDisagreement") && info["stateDisagreement"].trueValue()) {
435                 log() << "replset info " << h.toString() << " thinks that we are down" << endl;
436             }
437 
438             log() << "_requestHeartbeat - leaving function";
439             return ok;
440         }

The following is log for each heartbeat attempt when the other node in the replica-set goes down:

2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - entering function
2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - initial scoped connection check src/mongo/db/repl/heartbeat.cpp:367
2015-03-11T17:45:16.101-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.102-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
2015-03-11T17:45:16.102-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.103-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
2015-03-11T17:45:16.173-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.173-0400 [rsHealthPoll] replset info aks-osx.local:27018 heartbeat failed, retrying
2015-03-11T17:45:16.173-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for setting timeout src/mongo/db/repl/heartbeat.cpp:391
2015-03-11T17:45:16.175-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.175-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
2015-03-11T17:45:16.176-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.176-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
2015-03-11T17:45:16.310-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.311-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for resetting timeout src/mongo/db/repl/heartbeat.cpp:404
2015-03-11T17:45:16.312-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
2015-03-11T17:45:16.312-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
2015-03-11T17:45:16.313-0400 [rsHealthPoll] _requestHeartbeat - leaving function

Apart from logging, the flow can also lead to unintended behaviour in cases when the connection is successful during an in-between stage in the code.



 Comments   
Comment by Ramon Fernandez Marina [ 29/Apr/15 ]

The replication subsystem has been significantly re-written for the 3.0 version, so this should no longer be an issue in versions 3.0.0 and above.

Generated at Thu Feb 08 03:44:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.