Uploaded image for project: 'Go Driver'
  1. Go Driver
  2. GODRIVER-1658

connections should update topology faster for failed handshakes




      Updated Description

      We should change two aspects of updating the topology for connection handshakes:

      1. Successful handshakes should not update the topology. This can be done by removing the withServerDescriptionCallback connection option. Having this functionality is actually harmful unless we also have the connection handshakes update the average RTT because it causes our RTT calculations to go askew. Also, it causes unnecessary lock contention because the description for each new connection handshake needs to be applied.
      2. Failed handshakes should update the topology more quickly. Currently, SDAM error handling for handshakes is done in Server.Connection. This should be done in connection.connect rather than waiting for the connection to be checked out because the error could be stale by that time.

      Original Description

      We have a few replsets that are configured to span multiple regions (A and B, about 100ms apart).  Even though we're using the latency read preference selector with a threshold of 0ms (i.e. select the closest node), we regularly observe requests from region A being routed to nodes in region B.  After adding some debug logging, I've noticed that AverageRTT is regularly reset.


      I believe what's happening is that when a new connection to the server is needed, "connect" is called in topology/connection.go.  That calls "c.desc, err = handshaker.GetDescription(ctx, c.addr, handshakeConn)", and then "c.config.descCallback(c.desc)".  That callback is setup in "NewServer", via "s.pool, err = newPool(pc, withServerDescriptionCallback(callback, cfg.connectionOpts...)...)", with "callback := func(desc description.Server) { s.updateDescription(desc, false) }".


      "updateDescription" calls into "updateTopologyCallback", which is set in "Connect" and "ConnectServer", which calls "t.apply(context.TODO(), desc)".  This then applies the update to the topology itself.


      Unwinding, the issue is that the description returned by "handshaker.GetDescription" does not have AverageRTT set.  That implementation just calls "NewMaster(...).GetDescription()", which calls "IsMaster.Result(...)", and that does not set AverageRTT.


      It seems like the fix should be to preserve AverageRTT in most cases, possibly in "replaceServer" or "updateDescription".  It's possible there are other fields in "description.Server" that should be preserved; as a start, it might make sense to reorder fields into "things copied from the result of isMaster", like Compression and WireVersion, and things updated out of band, like AverageRTT.




            divjot.arora Divjot Arora
            bartle David Bartley
            0 Vote for this issue
            2 Start watching this issue