[CXX-291] 5 second connection timeout not working Created: 15/May/13  Updated: 24/Sep/14  Resolved: 24/Sep/14

Status: Closed
Project: C++ Driver
Component/s: Implementation
Affects Version/s: None
Fix Version/s: legacy-1.0.0-rc1

Type: Bug Priority: Minor - P4
Reporter: Lee Dixon Assignee: Tyler Brock
Resolution: Done Votes: 0
Labels: legacy-cxx
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04 LTS 64-bit


Attachments: File sock.cpp    

 Description   

The DBClientConnection::connect() call reports that it has a fixed 5-second timeout. On my system, this turns out to be 60 seconds.

Looking at the util/net/sock.cpp Socket::connect() call reveals the cause: the method used to enforce the 5 second timeout is a close() of the socket, which (on certain systems) causes the connect() to fail. At least on my system this is not the case. Instead, the connect() call continues and eventually times out with the system's timeout (60 seconds).

The appropriate way to do this timeout is using a select() call on the socket. Attached is a revised version of sock.cpp that implements the 5 second timeout appropriately, and works on my system.



 Comments   
Comment by Tyler Brock [ 24/Sep/14 ]

Also note that against localhost it fails immediately so it must be run against a non-local host.

Comment by Lee Dixon [ 24/Sep/14 ]

Good to hear it. I'll try upgrading to the latest version as my schedule allows. Thanks for looking into this!

Comment by Tyler Brock [ 24/Sep/14 ]

I'm closing this out as "works as designed" if we can find evidence to the contrary I'll re-open and address but it looks like timeout works now and will continue to work in the future.

Comment by Tyler Brock [ 24/Sep/14 ]

Hey Lee,

The new C++ driver will work in the way you describe (using a combination of non-blocking sockets and select or poll) as it will wrap the new C driver that currently functions that way.

On the other hand, I wrote a test program and cannot reproduce the behavior you are seeing on Ubuntu 12.04:

#include <cstdlib>
#include <ctime>
#include <iostream>
 
#include "mongo/client/dbclient.h"
 
using namespace mongo;
using namespace std;
 
int main() {
 
        string error_msg;
        time_t start = time(NULL);
 
        try {
                DBClientConnection conn(false, NULL, 5);
                conn.connect("fake.com:27017", error_msg);
        } catch (exception &e) {
                // catch the connect failure
        }
 
        cout << "Error msg: " << error_msg << endl;
        cout << "Connection failure time: " << difftime(time(NULL), start) << " seconds" << endl;
 
        return EXIT_SUCCESS;
}

The program produces the following output when compiled and run on Ubuntu 12.04:

Error msg: couldn't connect to server fake.com:27017 (217.69.40.156), connection attempt failed
Connection failure time: 5 seconds

Comment by Lee Dixon [ 19/Aug/14 ]

Final note: they do suggest using poll() if you need to stay away from select().

Regardless of how it's done, if the newer versions of the MongoDB client code can do as advertised, namely timeout after 5 seconds when the server is not available, then I'd consider this issue resolved.

Comment by Tyler Brock [ 19/Aug/14 ]

Right, and the FD_SETSIZE limit is on max absolute fd "number" not the count of FDs. Often they are the same but sometimes they are not.

You are most likely correct about "frequently" probably being too strong an adjective to describe external clients FD creation behavior but keep in mind that this code is used inside mongos which does frequently open tens of thousands of connections for large clusters. This code was created for that environment and, if changed upstream, would need to continue to work there.

Comment by Lee Dixon [ 19/Aug/14 ]

Ahhh, I see what you're talking about now: FD_SETSIZE is usually 1024. But to hit that limit, the client code would need to have opened 1023 file descriptors before this happens. I doubt clients are "frequently" opening that many sockets. Going above that requires the system administrator to override the 1024 sockets-per-process limit.

https://docs.fedoraproject.org/en-US/Fedora_Security_Team//html/Defensive_Coding/sect-Defensive_Coding-Tasks-Descriptors-Limit.html

Comment by Lee Dixon [ 19/Aug/14 ]

This was effecting me by blocking for 60 seconds instead of the reported five seconds. My test is for when the database server is not available at all. Having the connection block for that long means that a user has to wait 60 seconds to figure out that perhaps he typed in an IP address wrong into my application.

And select() calls are definitely valid for file descriptors above 1024; if it didn't, hardly anyone could use it. See my modified sock.cpp; it uses the select() call successfully. The select() call is a vital part of socket programming.

(Let me recant: I'm not sure about the 1024 limit, but I've never had an issue with using a select() call... I don't see that mentioned in the select() man page)

Comment by Tyler Brock [ 19/Aug/14 ]

Hey Lee,

I just got a chance to look at this again. I believe the reason we use a background thread + close here, as opposed to a non-block connect + select on the file descriptor, is because of the limitations of select. MongoDB clients are frequently returned file descriptors for sockets that are above 1024 by the host operating system making select unsuitable for use.

For legacy versions of the driver, this code is unlikely to change in order to facilitate backports from the server.

Can you help me understand how this is effecting you? How is this surfacing in your application? Is the connection eventually succeeding (taking longer than 5 seconds) but being closed upon success? I want to be sure I understand the motivations behind a more involved fix before embarking on a likely platform specific solution.

Comment by Lee Dixon [ 30/Jul/14 ]

Edited version of sock.cpp from version 2.4.3 which corrects the connection timeout from 60 seconds to 5.

Comment by Lee Dixon [ 30/Jul/14 ]

Tyler,
No, I don't have sample code, but it should be simple: just create a DBClientConnection and call connect(server-string, errmsg) and specify a server that isn't running.

I'll attach my sock.cpp (don't know why it's not here), but be sure to compare it to the 2.4.3 version of the linux CPP client (I got it from here: http://downloads.mongodb.org/cxx-driver/mongodb-linux-x86_64-2.4.3.tgz)

Maybe it's been fixed over the past year; I haven't tried new versions of the client yet.

Comment by Tyler Brock [ 30/Jul/14 ]

Hi Lee, I'm sorry you are having trouble. I don't see any attached code on this ticket.

Can you provide a sample application that demonstrates the bad behavior? I'm going to try and reproduce myself / look at the code but I wanted to see if you had already done so.

Generated at Wed Feb 07 21:58:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.