[GODRIVER-1658] connections should update topology faster for failed handshakes Created: 18/Jun/20 Updated: 28/Oct/23 Resolved: 29/Jun/20 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | Server Discovery and Monitoring |
| Affects Version/s: | None |
| Fix Version/s: | 1.3.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Bartley | Assignee: | Divjot Arora (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Updated Description We should change two aspects of updating the topology for connection handshakes:
Original Description We have a few replsets that are configured to span multiple regions (A and B, about 100ms apart). Even though we're using the latency read preference selector with a threshold of 0ms (i.e. select the closest node), we regularly observe requests from region A being routed to nodes in region B. After adding some debug logging, I've noticed that AverageRTT is regularly reset.
I believe what's happening is that when a new connection to the server is needed, "connect" is called in topology/connection.go. That calls "c.desc, err = handshaker.GetDescription(ctx, c.addr, handshakeConn)", and then "c.config.descCallback(c.desc)". That callback is setup in "NewServer", via "s.pool, err = newPool(pc, withServerDescriptionCallback(callback, cfg.connectionOpts...)...)", with "callback := func(desc description.Server) { s.updateDescription(desc, false) }".
"updateDescription" calls into "updateTopologyCallback", which is set in "Connect" and "ConnectServer", which calls "t.apply(context.TODO(), desc)". This then applies the update to the topology itself.
Unwinding, the issue is that the description returned by "handshaker.GetDescription" does not have AverageRTT set. That implementation just calls "NewMaster(...).GetDescription()", which calls "IsMaster.Result(...)", and that does not set AverageRTT.
It seems like the fix should be to preserve AverageRTT in most cases, possibly in "replaceServer" or "updateDescription". It's possible there are other fields in "description.Server" that should be preserved; as a start, it might make sense to reorder fields into "things copied from the result of isMaster", like Compression and WireVersion, and things updated out of band, like AverageRTT. |
| Comments |
| Comment by Divjot Arora (Inactive) [ 30/Jun/20 ] |
|
bartle The planned release date for 1.3.5 is early next week. We generally do patch releases on a monthly cadence. |
| Comment by David Bartley [ 29/Jun/20 ] |
|
ooc when do you plan to release 1.3.5? |
| Comment by Githook User [ 25/Jun/20 ] |
|
Author: {'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}Message: |
| Comment by Githook User [ 25/Jun/20 ] |
|
Author: {'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}Message: |
| Comment by Githook User [ 25/Jun/20 ] |
|
Author: {'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}Message: |
| Comment by Divjot Arora (Inactive) [ 22/Jun/20 ] |
| Comment by Divjot Arora (Inactive) [ 18/Jun/20 ] |
|
We should make two fixes for this ticket:
|
| Comment by Divjot Arora (Inactive) [ 18/Jun/20 ] |
|
bartle I think that is the correct fix. The server discovery and monitoring specification says that drivers should use isMaster replies from application connection handshakes to update the topology, but most drivers don't do this because there's little gain. I was planning on removing this functionality anyway as part of |
| Comment by David Bartley [ 18/Jun/20 ] |
|
I made a small change to remove the call to "withServerDescriptionCallback" and verified that almost no queries are being routed to the non-local region (I'm not sure that's the correct fix, I just wanted to validate my hypothesis). |
| Comment by David Bartley [ 18/Jun/20 ] |
|
Forgot to mention this was on version 1.3.3 of the driver. |