[CSHARP-4417] Int overflow in connectionId Created: 16/Nov/22  Updated: 28/Oct/23  Resolved: 28/Nov/22

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: None
Fix Version/s: 2.19.0

Type: Bug Priority: Major - P3
Reporter: Kaio Henrique Assignee: James Kovacs
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by DRIVERS-2503 ConnectionId returned in heartbeats m... Implementing
Related
is related to CSHARP-4483 ConnectionId returned in heartbeats m... Closed
Case:
Backwards Compatibility: Minor Change

 Description   

Summary

Eventually the connectionId is overflow and the driver can't connect
Driver version: 2.17.1

Driver error

MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
 ---> System.ArgumentOutOfRangeException: Value is not greater than or equal to 0: -2147483648. (Parameter \\'serverValue\\')
   at MongoDB.Driver.Core.Misc.Ensure.IsGreaterThanOrEqualToZero(Int32 value, String paramName)
   at MongoDB.Driver.Core.Connections.ConnectionId..ctor(ServerId serverId, Int32 localValue, Int32 serverValue)
   at MongoDB.Driver.Core.Connections.ConnectionId.WithServerValue(Int32 serverValue)

Server response

    Element: connectionId
                Type: Double (0x01)  <<<<<<<<<<<<--------- DOUBLE
                Value: 3228943842

 



 Comments   
Comment by Githook User [ 28/Nov/22 ]

Author:

{'name': 'James Kovacs', 'email': 'jkovacs@post.harvard.edu', 'username': 'JamesKovacs'}

Message: CSHARP-4417: ConnectionId should use longs for LocalValue and ServerValue to match 64-bit integer connectionId returned from the server.(#974)
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/4deb57af9370d1450d97ae947394f471fe85ffdb

Comment by James Kovacs [ 17/Nov/22 ]

You are correct that the ClusterRegistry currently shares the connection pools between MongoClient instances using the same ClusterKey (which is roughly equivalent to the same connection string). MongoClient instantiation does more work than simply establishing connection pools. So it is still a good idea to create the MongoClient once during application bootstrapping and cache it.

That said, you are correct that unless your service is restarting frequently, the same connections from the connection pools should be re-used. I would recommend investigating the root causes for connection churn including (but not limited to) restarts of microservices, high numbers of microservices, connection failures, connection timeouts, and more. This analysis is most easily done through server logs by looking at the number of connections established grouped by originating IP address. Once you know which applications are churning connections, you can then figure out why these applications are opening and closing connections so frequently.

Side Note: CSHARP-3431 will make the MongoClient disposable and likely change the ClusterRegistry behaviour, but that will not be released until the 3.0.0 driver. While caching your MongoClient instances may not have a large impact now (assuming the processes aren't restarting), we do recommend considering this change.

Comment by Kaio Henrique [ 17/Nov/22 ]

Hi, @James Kovacs
Thank you for the detailed answer.

Regarding the MongoClient, we actually see somes microservices instantiating it constantly. But reviewing the SDK code, we saw that the connection pool independs from MongoClient instance as it is placed in ClusterRegistry, which is static and the Cluster object should not change as long the connection string remains the same [reference]]. Are our conclusions correct? We are a bit confused if this refactor would really worth.

Comment by James Kovacs [ 17/Nov/22 ]

Hi, kaio.henrique@agilecontent.com,

Thank you for reporting this issue. We have investigated and found an infrequent data type discrepancy between our Driver specifications and the server. Let me explain further...

Examining the server code base, Client represents the remotely connected client (AKA the driver). This class contains a ConnectionId, which is a C++ long long or signed 64-bit integer.

typedef long long ConnectionId;

https://github.com/mongodb/mongo/blob/master/src/mongo/db/client.h#L60

hello.idl defines the hello response as containing connectionId as a safeInt64 further supporting your observation that the connectionId can be larger than an int32:

connectionId:
    type: safeInt64
    optional: true
    stability: stable

https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/hello.idl#L112-L115

Thus the server can return a signed 64-bit integer for the connectionId. However the server will truncate the connectionId to a signed 32-bit integer if possible when writing the BSON response. If you start up a mongod (so that the connectionId will be a small integer) and run the hello command from a .NET/C# app (or any other driver), you will notice that the BSON response contains connectionId with a value of type BsonInt32.

In order to overflow a BsonInt32, you would need to create 100 new connections per second for 8 months at a sustained rate. If your connection churn rate is higher, then the time to overflow a BsonInt32 will be shorter. But churning 100 connections per second for months on end is a rather high churn rate. This could happen if you are not caching and reusing your MongoClient instance in your application or if you have a lot of short-lived microservices which are creating new connections at a very high rate. Note that this will have a detrimental impact on connection pooling and overall performance. Our first recommendation is that you ensure that you are caching your MongoClient instance on application startup and pooling connections effectively.

If you cannot avoid the apparently high connection churn, another option would be to restart the affected mongod instances. Given that the server's connectionId counter is reset on server restart, restarting the affected mongod should mitigate the issue. If multiple servers in the cluster are affected, then a rolling restart would be recommended. In either case, the restart procedure can repeated as often as needed until a fix is available and you have upgraded your driver.

Regarding your question about older driver versions, the Ensure.IsGreaterThanOrEqualToZero(serverValue, nameof(serverValue)) has been present since .NET/C# Driver 2.0.0. Thus it is perplexing that your older components using Driver 2.73 are not affected. We encourage you to continue to work with our Technical Support Team to investigate further.

Sincerely,
James

Comment by Kaio Henrique [ 16/Nov/22 ]

In components using older versions of the driver (2.7.3), we are not detecting the issue. Is the SDK validating connectionId in recent versions?

Generated at Wed Feb 07 21:48:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.