[CSHARP-225] Strange NullReferenceException on acquiring connection Created: 07/May/11  Updated: 02/Apr/15  Resolved: 08/Jun/11

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.1
Fix Version/s: 1.1

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

I'm using a version built from source (git commit 1091cad...).



 Description   

Today, I got a strange exception:
System.NullReferenceException: Object reference not set to an instance of an object.
at MongoDB.Driver.MongoServer.AcquireConnection(MongoDatabase database, MongoServerInstance serverInstance) in mongodb-mongo-csharp-driver-1091cad\Driver\Core\MongoServer.cs:line 826
at MongoDB.Driver.MongoServer.AcquireConnection(MongoDatabase database, Boolean slaveOk) in mongodb-mongo-csharp-driver-1091cad\Driver\Core\MongoServer.cs:line 819
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.AcquireConnection() in mongodb-mongo-csharp-driver-1091cad\Driver\Internal\MongoCursorEnumerator.cs:line 137
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.GetFirst() in mongodb-mongo-csharp-driver-1091cad\Driver\Internal\MongoCursorEnumerator.cs:line 147
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.MoveNext() in mongodb-mongo-csharp-driver-1091cad\Driver\Internal\MongoCursorEnumerator.cs:line 83
at MongoDB.Driver.MongoDatabase.GetCollectionNames() in mongodb-mongo-csharp-driver-1091cad\Driver\Core\MongoDatabase.cs:line 560
at MongoDB.Driver.MongoDatabase.CollectionExists(String collectionName) in mongodb-mongo-csharp-driver-1091cad\Driver\Core\MongoDatabase.cs:line 277

My code:
var connectionString = "mongodb://server1,server2/databasename?replicaSet=replicaSetName"
var urlBuilder = new MongoUrlBuilder(connectionString);
urlBuilder.SafeMode = SafeMode.True;
var db = MongoDatabase.Create(urlBuilder.ToMongoUrl());
db.CollectionExists("collectionName"); // Exception was thrown here

The replica set consists of three machines. At the time of failure their roles were as follows:
server1: primary
server2: secondary
server3: arbiter (it was not specified in the connection string)

The exception only appeared once (my code is called from a lazy initialization code that is from a web page, so when I refreshed the page, the code that failed got called another one time and this time it was OK). This is a first operation on the database, slaveOk is set to false, so it appears that there's a problem in ReplicaSetConnector.Connect() so that it returns before it's able to acquire connection to primary, or instance list or metadata becomes damaged so that MongoServer.Primary returns null even if there is an instance, to me this looks like a race condition.



 Comments   
Comment by Aristarkh Zagorodnikov [ 09/Jun/11 ]

I confirm that the problem is no more with git version 15fb7bb. Thanks for the fix.

Comment by Aristarkh Zagorodnikov [ 09/Jun/11 ]

My bad, I used 1091cad as a reference, and you already fixed it in ed96fc1 when fixing CSHARP-218.

Comment by Robert Stam [ 08/Jun/11 ]

I think those two lines are OK. At that point it's a local variable called connectionPool, which isn't assigned to this.connectionPool until near the end of the method, and only if the connection is successful.

Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ]

Thanks, I'll check the fix tomorrow.
As for the other NRE reported, it's not only in MongoServerInstance.Disconnect(), but also in two places in MongoServerInstance.Connect() (lines 223 and 226, connectionPool can be null there if connection failed and Disconnect was called on the other thread).

Comment by Robert Stam [ 08/Jun/11 ]

Fixed.

Comment by Robert Stam [ 08/Jun/11 ]

Not vaguely described at all! The information you provided about different DNS names for the same host provided the vital clue to fix this.

I was able reproduce this on a single machine by the way. I created a replica set using the hostname of my development laptop, but then used "localhost" in the connection string.

Can you take a look at the fix I just committed and if you have time test again with the latest?

Thanks so much for all the effort you have put in to helping resolve this. It's very much appreciated.

Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ]

Yes, the reason lies in host list. The connector marks the connected hosts as invalid (since they aren't in the host list provided by the isMaster command) and removes them. Too bad that ReplicaSetConnector.Connect() already returned at this point since it already got its primary and delegated other connections to be made in the background.

So, the source of the problem lies in same machines named differently (and having different IP addresses) in connection strings and replica set configuration. I will fix my configuration, but I think that driver should have some checks to prevent this transient "look we have primary ... whoops, now we don't ... or do we?" condition.

Also, please take not of my previous comment, I'm not sure if it deserves a separate issue, but it still should be fixed also.

Thanks for your attention to my vaguely described problems =)

Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ]

As a side note, while trying to reproduce the problem locally, I encountered that MongoServerInstance.Connect(bool) has an NRE when connection fails in the finally block, and then in a subsequent catch block.

} finally

{ connectionPool.ReleaseConnection(connection); }

} catch

{ connectionPool.Close(); connectionPool = null; throw; }

This happens because connectionPool is set to null in MongoServerInstance.Disconnect() which is called by MongoConnection.HandleException() which is triggered by the error in execution the isMaster command on initial connect.

Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ]

Looks like I uncovered at least a part of a reason for the error, and maybe IsPrimary isn't the main reason of the problem.
In our production data center we use internal DNS records to facilitate easy naming of machines. Also, since each server we emply has at least two NICs, we dedicate one NIC for internal communications, slicing the traffic in different kinds and isolating them within different VLAN to allow for per-kind bandwidth management. So, one machine has several addresses and several records in our datacenter internal DNS records, which look like "l2-1.hostname.internal". It looks like I made a mistake and used a DNS record that doesn't match the hostname which was used to configure the replica set.
replicaSet:PRIMARY> db.isMaster()
{
"setName" : "replicaSet",
"ismaster" : true,
"secondary" : false,
"hosts" : [
"l1-1.host-1.zone",
"l1-1.host-2.zone"
],
"arbiters" : [
"l1-1.host-3.zone"
],
"maxBsonObjectSize" : 16777216,
"ok" : 1
}
While connecting, though, I use the different host names (c1.host-1.zone, etc) that have different IP addresses (l1-1.host-1 resolves to 10.1.1.23, while c1.host-1 resolves to 10.1.253.23). MongoDB is running on both of these interfaces (it's intended), but the host list that is returned from the db.isMaster() is different from the one I provide in the connection string.

Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ]

Robert, I'll try a few things later this week.

Comment by Robert Stam [ 08/Jun/11 ]

I modified the test program to call CollectionExists instead of Insert and still can't reproduce. I'm running all processes on a single machine so latency is about as low as it can get.

Could it be that the problem has been fixed since the version you are running with?

Comment by Aristarkh Zagorodnikov [ 07/Jun/11 ]

Looks like link latency is important here. It fails often when I run it in a production environment which has fast client and servers (2x quad-core Xeons, 16Gb of RAM each, <20% load average) with a dedicated 1Gbps link between client and server, but fails rarely when I do call it from my work notebook (i5 2.2Ghz, loaded with running programs) over 802.11g link to a cluster of slow machines (old servers used for testing, I think most powerful of them is an 1.8Ghz AMD Athlon with 4Gb of RAM).

Comment by Robert Stam [ 07/Jun/11 ]

Later today (after MongoNYC) I will modify the test program to call CollectionExists first. Thanks for the additional information.

Comment by Aristarkh Zagorodnikov [ 07/Jun/11 ]

It fails roughly 80% of the time on the initial (after an AppDomain restart) connect to a stable replicaset consisiting of three machines (one arbiter), which is running without any elections for about a month or two. As a side note, I would like to add that this error occurs only when the first executed MongoDB operation is CollectionExists(). It didn't fail even once when I was issuing an EnsureIndex as a first operation after a connection.

Comment by Robert Stam [ 06/Jun/11 ]

If it's not explainable by an election, can you think of ways to reproduce this?

I created a test program with 10000 threads, each of which inserts one document into a collection. In addition, every 100th thread calls server.Disconnect before calling Insert, just to force lots and lots of reconnections.

So far have been unable to reproduce.

Comment by Robert Stam [ 06/Jun/11 ]

Made a change so that instead of a NullReferenceException it now throws a MongoConnectionException with the message "Primary server not found".

The condition you ran into would occur naturally whenever a replica set was in the process of electing a new primary (because until the election is over there is no primary).

Are you sure your exception can't be explained by an election? I haven't yet found any possible race conditions (which doesn't mean it doesn't exist, of course).

Comment by Aristarkh Zagorodnikov [ 07/May/11 ]

Adding arbiter (server3) to connection string doesn't help, it still fails occasionally.

Comment by Aristarkh Zagorodnikov [ 07/May/11 ]

I tested it a little bit more. It failed two more times, out of about 10, definitely looks like a race condition to me.

Generated at Wed Feb 07 21:36:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.