Uploaded image for project: 'Java Driver'
  1. Java Driver
  2. JAVA-3457

Gracefully handle mongos nodes exiting via mongodb+srv://

    • Type: Icon: New Feature New Feature
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cluster Management
    • Labels:
      None

      We recently set up a shared cluster of MongoS servers in kubernetes via the fairly new mongodb+srv record support (https://www.mongodb.com/blog/post/mongodb-3-6-here-to-SRV-you-with-easier-replica-set-connections)

      In Kubernetes, when nodes enter a terminating state, they are removed from from both the SRV record broadcast, and their DNS resolution will also no longer succeed. In some cases (depends on configuration), they may still be available to handle connections for some amount of time, until the pod has fully terminated.

      The Mongo java driver currently scans SRV records every 60 seconds, which is hardcoded].]

       

      When a mongos pod enters termination, that leaves an up-to-60-second gap where, to my understanding, we can hit issues in the java mongo driver through the following path.

       

      1. The mongodb java driver selects a random host from known available hosts - assume it has chosen a recently terminated host
      2. If the connection pool needs to spawn a new connection, the driver does a dns lookup on the host. link
      3. The DNS lookup fails for the recently shut down host. This throws an exception which invalidates all active connections to this host (including currently-functioning connections) link 
      4. Until the SrvRecordMonitor refreshes it's host pool, all queries have a 1/pool_size chance of failing because server selection is random. Operation retries don't fully handle failure, but reduce the chance of query failure to (1/pool_size ^ retry_count)

       

      There seem to be a couple potential mechanisms for improving this. I can imagine blacklisting hosts that have experienced dns failures until the next refresh when using mongodb+srv, but there seem to be several reasonable options.

       

      We'd be happy to contribute a patch here if there's an agreed upon handling strategy for us to pursue.

       

            Assignee:
            jeff.yemin@mongodb.com Jeffrey Yemin
            Reporter:
            bpicolo@squarespace.com Ben Picolo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: