[SERVER-15424] Check compatibilty of connection behavior in new heartbeats Created: 26/Sep/14  Updated: 03/Mar/15  Resolved: 22/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.8.0-rc3

Type: Task Priority: Critical - P2
Reporter: Scott Hernandez (Inactive) Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-16461 Setting socket timeouts less than 1.0... Closed
Tested
Backwards Compatibility: Fully Compatible
Participants:

 Description   

We have changed how heartbeats work via the executor and the underlying connections are different than how 2.6 worked. We should investigate how this compares and affects monitoring wrt how 2.6 worked.

Plan:

  • Run the following tests with 2.6, 2.8 and mixed 2.6/8 sets
  • Test each configuration to find:
    • timing of heartbeats in healthy state
    • connection recycling in healthy and error/timeout states
    • retry behavior with timeouts (both network connection and response), failed host resolution, bad responses
    • replSetStatus output during each test


 Comments   
Comment by Githook User [ 09/Dec/14 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-15424 Fix signed-unsigned integer comparison
Branch: master
https://github.com/mongodb/mongo/commit/bc38b5af5637edc8a9aaa9708fcec106b4bc4325

Comment by Githook User [ 09/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-15424 Eliminate unused ConnectionPool::Options structure in NetworkInterfaceImpl.
Branch: master
https://github.com/mongodb/mongo/commit/1be586d431da882e276cc8c05b43881cd706e88a

Comment by Githook User [ 08/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-15424 Fix socket timeout computation in replication network interface.
Branch: master
https://github.com/mongodb/mongo/commit/87468e812ea2669a462144391bd5aa6d9e782775

Comment by Githook User [ 08/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-15424 Dynamically size replication network thread pool.
Branch: master
https://github.com/mongodb/mongo/commit/4c8eab7305bddbf544c6c9c62fe871486d012af3

Comment by Githook User [ 08/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-15424 Age out old connections used by replication for communicating with other members.

By capping the lifetime of connections used for heartbeats at 30 seconds, we
ensure that DNS and firewall changes that only affect new connections eventually
impact nodes' belief about the reachability of other nodes in the replica set.
Branch: master
https://github.com/mongodb/mongo/commit/2d79068ff6c0a7059fc39e72db88fc6a4f674e33

Comment by Amalia Hawkins [ 25/Nov/14 ]

No, a new primary is not elected. The "No address associated with hostname" message appears multiple times, and the replSet attempts to sync for 10 seconds repeatedly.

Comment by Andy Schwerin [ 24/Nov/14 ]

amalia.hawkins@10gen.com, the question is, do the remaining nodes ever elect another primary?

Comment by Amalia Hawkins [ 24/Nov/14 ]

Under case (1), the replica set realizes DNS issues when the primary's hostname changes.

2014-11-24T17:40:07.809-0500 I REPL     [ReplicationExecutor] syncing from: osprey:27017
2014-11-24T17:40:07.826-0500 I NETWORK  [rsBackgroundSync] getaddrinfo("osprey") failed: No address associated with hostname
2014-11-24T17:40:07.826-0500 I REPL     [rsBackgroundSync] repl: couldn't initialize connection to host osprey, address is invalid
2014-11-24T17:40:07.826-0500 I REPL     [ReplicationExecutor] could not find member to sync from

Comment by Andy Schwerin [ 24/Nov/14 ]

There are two major divergences from 2.6 behavior here, at least one of which might need to be remedied for 2.8.0

  1. Functioning heartbeat connections are never retired, which means that DNS failures might never be recognized by the replica set.
  2. Only 8 threads are ever used for heartbeat network traffic. A replica set with 15 nodes, 7 of which vote and are responsive and 8 of which do not vote and have typically long network latencies, might cause spurious primary step downs.
Generated at Thu Feb 08 03:37:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.