Details
-
Improvement
-
Resolution: Done
-
Minor - P4
-
2.6.5
-
None
-
Minor Change
Description
When the other side of the heartbeat channel goes down, the RS heartbeat attempts connection multiple time in a way that may not be intended and also causes log flooding for these failures as a result. This may be ok when a node goes out for a small duration but does not seem to be very intuitive when a longer maintenance task needs to be performed on the replica-set nodes.
This is evident from the following code instrumented with additional log statements (src/mongo/db/repl/heartbeat.cpp):
364 bool _requestHeartbeat(HeartbeatInfo& mem, BSONObj& info, int& theirConfigVersion) {
|
365 log() << "_requestHeartbeat - entering function";
|
366 {
|
367 log() << "_requestHeartbeat - initial scoped connection check " << __FILE__ << ":" << __LINE__;
|
368 ScopedConn conn(h.toString());
|
369 conn.setTimeout(_timeout);
|
370 if (tries++ % threshold == (threshold - 1)) {
|
371 log() << "_requestHeartbeat - reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
|
372 conn.reconnect();
|
373 log() << "_requestHeartbeat - after failed reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
|
374 }
|
375 }
|
376
|
377 Timer timer;
|
378 time_t before = curTimeMicros64() / 1000000;
|
379
|
380 bool ok = tryHeartbeat(&info, &theirConfigVersion);
|
381
|
382 mem.ping = static_cast<unsigned int>(timer.millis());
|
383 time_t totalSecs = mem.ping / 1000;
|
384
|
385 // if that didn't work and we have more time, lower timeout and try again
|
386 if (!ok && totalSecs < _timeout) {
|
387 log() << "replset info " << h.toString() << " heartbeat failed, retrying" << rsLog;
|
388
|
389 // lower timeout to remaining ping time
|
390 {
|
391 log() << "_requestHeartbeat - getting scoped connection again for setting timeout " << __FILE__ << ":" << __LINE__;
|
392 ScopedConn conn(h.toString());
|
393 conn.setTimeout(_timeout - totalSecs);
|
394 }
|
395
|
396 int checkpoint = timer.millis();
|
397 timer.reset();
|
398 ok = tryHeartbeat(&info, &theirConfigVersion);
|
399 mem.ping = static_cast<unsigned int>(timer.millis());
|
400 totalSecs = (checkpoint + mem.ping)/1000;
|
401
|
402 // set timeout back to default
|
403 {
|
404 log() << "_requestHeartbeat - getting scoped connection again for resetting timeout " << __FILE__ << ":" << __LINE__;
|
405 ScopedConn conn(h.toString());
|
406 conn.setTimeout(_timeout);
|
407 }
|
408 }
|
409
|
410 // we set this on any response - we don't get this far if
|
411 // couldn't connect because exception is thrown
|
412 time_t after = mem.lastHeartbeat = before + totalSecs;
|
413
|
414 if ( info["time"].isNumber() ) {
|
415 long long t = info["time"].numberLong();
|
416 if( t > after )
|
417 mem.skew = (int) (t - after);
|
418 else if( t < before )
|
419 mem.skew = (int) (t - before); // negative
|
420 }
|
421 else {
|
422 // it won't be there if remote hasn't initialized yet
|
423 if( info.hasElement("time") )
|
424 warning() << "heatbeat.time isn't a number: " << info << endl;
|
425 mem.skew = INT_MIN;
|
426 }
|
427
|
428 {
|
429 be state = info["state"];
|
430 if( state.ok() )
|
431 mem.hbstate = MemberState(state.Int());
|
432 }
|
433
|
434 if (info.hasField("stateDisagreement") && info["stateDisagreement"].trueValue()) {
|
435 log() << "replset info " << h.toString() << " thinks that we are down" << endl;
|
436 }
|
437
|
438 log() << "_requestHeartbeat - leaving function";
|
439 return ok;
|
440 }
|
The following is log for each heartbeat attempt when the other node in the replica-set goes down:
2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - entering function
|
2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - initial scoped connection check src/mongo/db/repl/heartbeat.cpp:367
|
2015-03-11T17:45:16.101-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.102-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
|
2015-03-11T17:45:16.102-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.103-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
|
2015-03-11T17:45:16.173-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.173-0400 [rsHealthPoll] replset info aks-osx.local:27018 heartbeat failed, retrying
|
2015-03-11T17:45:16.173-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for setting timeout src/mongo/db/repl/heartbeat.cpp:391
|
2015-03-11T17:45:16.175-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.175-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
|
2015-03-11T17:45:16.176-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.176-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
|
2015-03-11T17:45:16.310-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.311-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for resetting timeout src/mongo/db/repl/heartbeat.cpp:404
|
2015-03-11T17:45:16.312-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
|
2015-03-11T17:45:16.312-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
|
2015-03-11T17:45:16.313-0400 [rsHealthPoll] _requestHeartbeat - leaving function
|
Apart from logging, the flow can also lead to unintended behaviour in cases when the connection is successful during an in-between stage in the code.