[CSHARP-225] Strange NullReferenceException on acquiring connection Created: 07/May/11 Updated: 02/Apr/15 Resolved: 08/Jun/11 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.1 |
| Fix Version/s: | 1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
I'm using a version built from source (git commit 1091cad...). |
||
| Description |
|
Today, I got a strange exception: My code: The replica set consists of three machines. At the time of failure their roles were as follows: The exception only appeared once (my code is called from a lazy initialization code that is from a web page, so when I refreshed the page, the code that failed got called another one time and this time it was OK). This is a first operation on the database, slaveOk is set to false, so it appears that there's a problem in ReplicaSetConnector.Connect() so that it returns before it's able to acquire connection to primary, or instance list or metadata becomes damaged so that MongoServer.Primary returns null even if there is an instance, to me this looks like a race condition. |
| Comments |
| Comment by Aristarkh Zagorodnikov [ 09/Jun/11 ] |
|
I confirm that the problem is no more with git version 15fb7bb. Thanks for the fix. |
| Comment by Aristarkh Zagorodnikov [ 09/Jun/11 ] |
|
My bad, I used 1091cad as a reference, and you already fixed it in ed96fc1 when fixing |
| Comment by Robert Stam [ 08/Jun/11 ] |
|
I think those two lines are OK. At that point it's a local variable called connectionPool, which isn't assigned to this.connectionPool until near the end of the method, and only if the connection is successful. |
| Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ] |
|
Thanks, I'll check the fix tomorrow. |
| Comment by Robert Stam [ 08/Jun/11 ] |
|
Fixed. |
| Comment by Robert Stam [ 08/Jun/11 ] |
|
Not vaguely described at all! The information you provided about different DNS names for the same host provided the vital clue to fix this. I was able reproduce this on a single machine by the way. I created a replica set using the hostname of my development laptop, but then used "localhost" in the connection string. Can you take a look at the fix I just committed and if you have time test again with the latest? Thanks so much for all the effort you have put in to helping resolve this. It's very much appreciated. |
| Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ] |
|
Yes, the reason lies in host list. The connector marks the connected hosts as invalid (since they aren't in the host list provided by the isMaster command) and removes them. Too bad that ReplicaSetConnector.Connect() already returned at this point since it already got its primary and delegated other connections to be made in the background. So, the source of the problem lies in same machines named differently (and having different IP addresses) in connection strings and replica set configuration. I will fix my configuration, but I think that driver should have some checks to prevent this transient "look we have primary ... whoops, now we don't ... or do we?" condition. Also, please take not of my previous comment, I'm not sure if it deserves a separate issue, but it still should be fixed also. Thanks for your attention to my vaguely described problems =) |
| Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ] |
|
As a side note, while trying to reproduce the problem locally, I encountered that MongoServerInstance.Connect(bool) has an NRE when connection fails in the finally block, and then in a subsequent catch block. } finally { connectionPool.ReleaseConnection(connection); }} catch { connectionPool.Close(); connectionPool = null; throw; }This happens because connectionPool is set to null in MongoServerInstance.Disconnect() which is called by MongoConnection.HandleException() which is triggered by the error in execution the isMaster command on initial connect. |
| Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ] |
|
Looks like I uncovered at least a part of a reason for the error, and maybe IsPrimary isn't the main reason of the problem. |
| Comment by Aristarkh Zagorodnikov [ 08/Jun/11 ] |
|
Robert, I'll try a few things later this week. |
| Comment by Robert Stam [ 08/Jun/11 ] |
|
I modified the test program to call CollectionExists instead of Insert and still can't reproduce. I'm running all processes on a single machine so latency is about as low as it can get. Could it be that the problem has been fixed since the version you are running with? |
| Comment by Aristarkh Zagorodnikov [ 07/Jun/11 ] |
|
Looks like link latency is important here. It fails often when I run it in a production environment which has fast client and servers (2x quad-core Xeons, 16Gb of RAM each, <20% load average) with a dedicated 1Gbps link between client and server, but fails rarely when I do call it from my work notebook (i5 2.2Ghz, loaded with running programs) over 802.11g link to a cluster of slow machines (old servers used for testing, I think most powerful of them is an 1.8Ghz AMD Athlon with 4Gb of RAM). |
| Comment by Robert Stam [ 07/Jun/11 ] |
|
Later today (after MongoNYC) I will modify the test program to call CollectionExists first. Thanks for the additional information. |
| Comment by Aristarkh Zagorodnikov [ 07/Jun/11 ] |
|
It fails roughly 80% of the time on the initial (after an AppDomain restart) connect to a stable replicaset consisiting of three machines (one arbiter), which is running without any elections for about a month or two. As a side note, I would like to add that this error occurs only when the first executed MongoDB operation is CollectionExists(). It didn't fail even once when I was issuing an EnsureIndex as a first operation after a connection. |
| Comment by Robert Stam [ 06/Jun/11 ] |
|
If it's not explainable by an election, can you think of ways to reproduce this? I created a test program with 10000 threads, each of which inserts one document into a collection. In addition, every 100th thread calls server.Disconnect before calling Insert, just to force lots and lots of reconnections. So far have been unable to reproduce. |
| Comment by Robert Stam [ 06/Jun/11 ] |
|
Made a change so that instead of a NullReferenceException it now throws a MongoConnectionException with the message "Primary server not found". The condition you ran into would occur naturally whenever a replica set was in the process of electing a new primary (because until the election is over there is no primary). Are you sure your exception can't be explained by an election? I haven't yet found any possible race conditions (which doesn't mean it doesn't exist, of course). |
| Comment by Aristarkh Zagorodnikov [ 07/May/11 ] |
|
Adding arbiter (server3) to connection string doesn't help, it still fails occasionally. |
| Comment by Aristarkh Zagorodnikov [ 07/May/11 ] |
|
I tested it a little bit more. It failed two more times, out of about 10, definitely looks like a race condition to me. |