[GODRIVER-1658] connections should update topology faster for failed handshakes Created: 18/Jun/20  Updated: 28/Oct/23  Resolved: 29/Jun/20

Status: Closed
Project: Go Driver
Component/s: Server Discovery and Monitoring
Affects Version/s: None
Fix Version/s: 1.3.5

Type: Bug Priority: Major - P3
Reporter: David Bartley Assignee: Divjot Arora (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Updated Description

We should change two aspects of updating the topology for connection handshakes:

  1. Successful handshakes should not update the topology. This can be done by removing the withServerDescriptionCallback connection option. Having this functionality is actually harmful unless we also have the connection handshakes update the average RTT because it causes our RTT calculations to go askew. Also, it causes unnecessary lock contention because the description for each new connection handshake needs to be applied.
  2. Failed handshakes should update the topology more quickly. Currently, SDAM error handling for handshakes is done in Server.Connection. This should be done in connection.connect rather than waiting for the connection to be checked out because the error could be stale by that time.

Original Description

We have a few replsets that are configured to span multiple regions (A and B, about 100ms apart).  Even though we're using the latency read preference selector with a threshold of 0ms (i.e. select the closest node), we regularly observe requests from region A being routed to nodes in region B.  After adding some debug logging, I've noticed that AverageRTT is regularly reset.

 

I believe what's happening is that when a new connection to the server is needed, "connect" is called in topology/connection.go.  That calls "c.desc, err = handshaker.GetDescription(ctx, c.addr, handshakeConn)", and then "c.config.descCallback(c.desc)".  That callback is setup in "NewServer", via "s.pool, err = newPool(pc, withServerDescriptionCallback(callback, cfg.connectionOpts...)...)", with "callback := func(desc description.Server) { s.updateDescription(desc, false) }".

 

"updateDescription" calls into "updateTopologyCallback", which is set in "Connect" and "ConnectServer", which calls "t.apply(context.TODO(), desc)".  This then applies the update to the topology itself.

 

Unwinding, the issue is that the description returned by "handshaker.GetDescription" does not have AverageRTT set.  That implementation just calls "NewMaster(...).GetDescription()", which calls "IsMaster.Result(...)", and that does not set AverageRTT.

 

It seems like the fix should be to preserve AverageRTT in most cases, possibly in "replaceServer" or "updateDescription".  It's possible there are other fields in "description.Server" that should be preserved; as a start, it might make sense to reorder fields into "things copied from the result of isMaster", like Compression and WireVersion, and things updated out of band, like AverageRTT.



 Comments   
Comment by Divjot Arora (Inactive) [ 30/Jun/20 ]

bartle The planned release date for 1.3.5 is early next week. We generally do patch releases on a monthly cadence.

Comment by David Bartley [ 29/Jun/20 ]

ooc when do you plan to release 1.3.5?

Comment by Githook User [ 25/Jun/20 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1658 Remove extraneous go.mod addition in backport
Branch: release/1.3
https://github.com/mongodb/mongo-go-driver/commit/6b450cf6e88d8a7e656c91751ae9c980f3a310bd

Comment by Githook User [ 25/Jun/20 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1658 Fix connection updates to topology (#429)
Branch: release/1.3
https://github.com/mongodb/mongo-go-driver/commit/0eb881e680a929d592d6d1c5a2cd93f3f03c89e9

Comment by Githook User [ 25/Jun/20 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1658 Fix connection updates to topology (#429)
Branch: master
https://github.com/mongodb/mongo-go-driver/commit/34d063f327ed6d3c01cc2aa4c55b0c075d61f8a0

Comment by Divjot Arora (Inactive) [ 22/Jun/20 ]

https://github.com/mongodb/mongo-go-driver/pull/429

Comment by Divjot Arora (Inactive) [ 18/Jun/20 ]

We should make two fixes for this ticket:

  1. Remove the functionality to update the topology during successful handshakes.
  2. Fix how we update the topology for failed handshakes. Currently, we implement the "before handshake completes" part of SDAM error handling (https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#network-error-when-reading-or-writing) in the Server.Connection method. However, if we encounter an error while creating connections for the MinPoolSize maintenance routine, we won't handle that error until the connection is actually checked out. We should fix this by having connection.connect pass the handshaking error to the associated Server.
Comment by Divjot Arora (Inactive) [ 18/Jun/20 ]

bartle I think that is the correct fix. The server discovery and monitoring specification says that drivers should use isMaster replies from application connection handshakes to update the topology, but most drivers don't do this because there's little gain. I was planning on removing this functionality anyway as part of GODRIVER-1489 and DRIVERS-1300 has been created to amend the specification as well. Because it's actually causing a bug, I'll do this and cherry-pick to the 1.3.x branch so it can go into the 1.3.5 release as well.

Comment by David Bartley [ 18/Jun/20 ]

I made a small change to remove the call to "withServerDescriptionCallback" and verified that almost no queries are being routed to the non-local region (I'm not sure that's the correct fix, I just wanted to validate my hypothesis).

Comment by David Bartley [ 18/Jun/20 ]

Forgot to mention this was on version 1.3.3 of the driver.

Generated at Thu Feb 08 08:36:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.