Java Driver
  1. Java Driver
  2. JAVA-481

Driver not retrying on Connection timed out SocketException

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 2.6.5
    • Fix Version/s: None
    • Component/s: Performance
    • Operating System:
      Linux
    • # Replies:
      7
    • Last comment by Customer:
      true

      Description

      I've got a MongoDB replica set across two datacenters. In my second data center I have some servers that point back to the primary instance in data center 1.

      I ran into a connection timeout issue (this happens pretty consistently) on the server, here is the stack trace:

      com.mongodb.DBPortPool gotError
      WARNING: emptying DBPortPool to 10.240.110.42:27017 b/c of error
      java.net.SocketException: Connection timed out
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(SocketInputStream.java:129)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
      at org.bson.io.Bits.readFully(Bits.java:35)
      at org.bson.io.Bits.readFully(Bits.java:28)
      at com.mongodb.Response.<init>(Response.java:39)
      at com.mongodb.DBPort.go(DBPort.java:123)
      at com.mongodb.DBPort.go(DBPort.java:82)
      at com.mongodb.DBPort.call(DBPort.java:72)
      at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:202)
      at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:303)
      at com.mongodb.DBCollection.findOne(DBCollection.java:565)
      at com.mongodb.DBCollection.findOne(DBCollection.java:554)

      Is it possible for the driver to attempt recreate the connections and retry the query? It looks like the next query worked as expected.

      I do not see these errors on servers in the same data center as the primary mongodb server

      Note: latency between the my app server and the primary mongodb server is ~50 ms

      THANKS!

        Activity

        Hide
        Antoine Girbal
        added a comment -

        the driver never retries a read after a timeout exception.
        It could lead to long delays and make whole system unstable if the reason for timeout is db overloaded.
        Also the retry feature only retries to a different server for read, so you have to use slaveOk to allow reads from replica.

        Solutions for you:

        • use slaveOk and driver will read from closest server (though your app must accept eventual consistency)
        • retry manually in your app.
        Show
        Antoine Girbal
        added a comment - the driver never retries a read after a timeout exception. It could lead to long delays and make whole system unstable if the reason for timeout is db overloaded. Also the retry feature only retries to a different server for read, so you have to use slaveOk to allow reads from replica. Solutions for you: use slaveOk and driver will read from closest server (though your app must accept eventual consistency) retry manually in your app.
        Hide
        Antoine Girbal
        added a comment -

        Feel free to follow up, though mongodb-user group may be a better place to get quick answers.
        thx

        Show
        Antoine Girbal
        added a comment - Feel free to follow up, though mongodb-user group may be a better place to get quick answers. thx
        Hide
        John Danner
        added a comment -

        I believe the root cause of the issue I'm experiencing is a firewall closing down the inactive connections. The driver isn't aware of the connection status until it's passed it off to the application above.

        I've updated the driver to periodically (configurable) issue the ping command to connections that have not been used in a configurable amount of time. I believe this will resolve the issue for me - would this patch be an acceptable addition to the default driver?

        The patch enables a monitoring thread to spin through the list of available DBPorts and checks on their last use, if it's over 15 minutes since use the ping command is issued which should keep the connection alive through a firewall or similar device.

        Show
        John Danner
        added a comment - I believe the root cause of the issue I'm experiencing is a firewall closing down the inactive connections. The driver isn't aware of the connection status until it's passed it off to the application above. I've updated the driver to periodically (configurable) issue the ping command to connections that have not been used in a configurable amount of time. I believe this will resolve the issue for me - would this patch be an acceptable addition to the default driver? The patch enables a monitoring thread to spin through the list of available DBPorts and checks on their last use, if it's over 15 minutes since use the ping command is issued which should keep the connection alive through a firewall or similar device.
        Hide
        John Danner
        added a comment -

        I should note I'm using the socketKeepAlive option within the driver but the connectivity behavior remains

        Show
        John Danner
        added a comment - I should note I'm using the socketKeepAlive option within the driver but the connectivity behavior remains
        Hide
        Antoine Girbal
        added a comment -

        that should be the point of socketKeepAlive.
        It just does a socket.setKeepAlive(true) on the java socket, which should then make the TCP send heartbeats on the socket.
        It would be better if we could figure out why this is not fixing the firewall issue, rather than adding extra code / thread that does it at app layer.

        Show
        Antoine Girbal
        added a comment - that should be the point of socketKeepAlive. It just does a socket.setKeepAlive(true) on the java socket, which should then make the TCP send heartbeats on the socket. It would be better if we could figure out why this is not fixing the firewall issue, rather than adding extra code / thread that does it at app layer.
        Hide
        John Danner
        added a comment -

        I think the issue is that the default linux keep-alive timeout is 2 hours prior to sending the first keep-alive packet and the default connection timeout on a Cisco ASA is 1 hour. This would suggest that despite setting the value it won't really do anything (in this situation/configuration).

        I'll adjust one of my system's keep-alive parameters to see if it will fix my issue - perhaps if someone else happens upon this ticket they will find it useful. If there is a desire for a more robust keep alive that isn't reliant on system setting/firewall settings I can submit this code as a patch.

        Show
        John Danner
        added a comment - I think the issue is that the default linux keep-alive timeout is 2 hours prior to sending the first keep-alive packet and the default connection timeout on a Cisco ASA is 1 hour. This would suggest that despite setting the value it won't really do anything (in this situation/configuration). I'll adjust one of my system's keep-alive parameters to see if it will fix my issue - perhaps if someone else happens upon this ticket they will find it useful. If there is a desire for a more robust keep alive that isn't reliant on system setting/firewall settings I can submit this code as a patch.
        Hide
        Brett Cave
        added a comment -

        This issue also occurs frequently in AWS EC2(Amazon Web Services). We see this daily. When reviewing mongo configuration for production, we came across the production checklist, which included a suggestion to drop the TCP_KEEPALIVE kernel configuration to a lower value. When dropping to 5 minutes, the errors started occurring much more frequently. We have now set keepalive back to 7200 (at OS level, not driver level) to reduce the frequency of this occurring.

        Show
        Brett Cave
        added a comment - This issue also occurs frequently in AWS EC2(Amazon Web Services). We see this daily. When reviewing mongo configuration for production, we came across the production checklist, which included a suggestion to drop the TCP_KEEPALIVE kernel configuration to a lower value. When dropping to 5 minutes, the errors started occurring much more frequently. We have now set keepalive back to 7200 (at OS level, not driver level) to reduce the frequency of this occurring.

          People

          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              1 year, 3 days ago
              Date of 1st Reply: