[CSHARP-268] NullReferenceException on lost primary Created: 13/Jul/11  Updated: 02/Apr/15  Resolved: 01/Aug/11

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.2
Fix Version/s: 1.2

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

git version d7ce7f2ee560183d8031


Issue Links:
Duplicate
is duplicated by CSHARP-288 mongo-csharp-driver 1.1 NullReference... Closed
Related
related to CSHARP-294 Deadlock when connecting to a replica... Closed

 Description   

3-server replica set (primary, slave, arbiter). Primary steps down, the next query (actually any queries) fail with NullReferenceException with the following stack trace:
MongoDB.Driver.DLL!MongoDB.Driver.MongoServerInstance.AcquireConnection(MongoDB.Driver.MongoDatabase database) Line 183 + 0x8 bytes
MongoDB.Driver.DLL!MongoDB.Driver.MongoServer.AcquireConnection(MongoDB.Driver.MongoDatabase database, bool slaveOk) Line 893 + 0xf bytes
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<System.__Canon>.AcquireConnection() Line 184 + 0x42 bytes
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<MongoDB.Bson.BsonDocument>.GetFirst() Line 194 + 0xc bytes
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<System.__Canon>.MoveNext() Line 126 + 0x8 bytes

It appears that connectionPool is null, probably because MongoServerInstance.Disconnect due to the connection error. Might be related to CSHARP-217 and CSHARP-233. I'll post more information later.



 Comments   
Comment by Aristarkh Zagorodnikov [ 02/Aug/11 ]

My default RS setup: primary, secondary, arbiter; slaveOk is false. Primary goes down, all following queries fail with InvalidOperationException: Server instance <host>:<port> is no longer connected. (host/port match the ones of the gone server). Stack trace follows:

MongoDB.Driver.DLL!MongoDB.Driver.MongoServerInstance.AcquireConnection(MongoDB.Driver.MongoDatabase database) Line 200 C#
MongoDB.Driver.DLL!MongoDB.Driver.MongoServer.AcquireConnection(MongoDB.Driver.MongoDatabase database, bool slaveOk) Line 946 + 0xe bytes C#
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<System.__Canon>.AcquireConnection() Line 184 + 0x42 bytes C#
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<MongoDB.Bson.BsonDocument>.GetFirst() Line 194 + 0xc bytes C#
MongoDB.Driver.DLL!MongoDB.Driver.MongoCursorEnumerator<System.__Canon>.MoveNext() Line 126 + 0x8 bytes C#

Comment by Robert Stam [ 01/Aug/11 ]

Comitted a fix for the deadlock. Added an additional lock (stateLock) to synchronize lower level state changes that happen in response to events on multiple threads. The remaining MongoServer operations are synchronized with serverLock.

p.s. Still need to do some more testing on replica sets (specially failover) so there may be further change forthcoming.

Comment by Robert Stam [ 01/Aug/11 ]

Fix had problems with deadlocks. More work needed.

Comment by Robert Stam [ 01/Aug/11 ]

Added locking to MongoServerInstance to make it thread safe. Simplified locking in MongoServer by using just one lock (multiple locks were confusing and had the potential for deadlock).

Comment by Aristarkh Zagorodnikov [ 13/Jul/11 ]

It also appears that MongoServer.AcquireConnection(MongoDatabase, MongoServerInstance) call hierarchy is not lock-protected at all (looks like it's used for cursors since they "bind" to a specific server instance), so its' calls to MongoServerInstance.AcquireConnection is also vulnerable to a race condition with MongoServerInstance.Disconnect calls.

Comment by Aristarkh Zagorodnikov [ 13/Jul/11 ]

Yes, MongoServerInstance.Disconnect is called and the primary-before-the-step-down MongoServerInstance becomes disconnected, although MongoServer.GetServerInstance keeps returning it using the MongoServer.Primary property.
I'm not sure what protocol is followed by the replica set connector to reconnect to the servers, but I believe that MongoServer.GetServerInstance should check the server that is returned Primary property to be connected (it checks slaves to be connected if slaveOk is enabled) and probably treat this as a "no primary" case, behaving as if Primary returned null (for some reasons that I can't explain even to myself I don't think that Primary property itself should check if it's connected).

Also, I would like to note that it appears that MongoServerInstance.Disconnect and MongoServerInstance.AcquireConnection can be called concurrently (there is no consistent locks around all of these calls), so this might lead to some kind of a nasty race condition.
For example, consider there is a network failure, so MongoConnection.SendMessage fails, calls MongoConnection.Disconnect. This path is protected only by the lock (MongoConnection.connectionLock). Now, some other thread performs a cursor enumeration that leads to a need to acquire a connection, hence the MongoServer.AcquireConnection is called, that calls GetServerInstance that can lead to the same NRE, because AcquireConnection locks on the MongoServer.serverLock.
So, it appears that MongoServer performs connection management (acquiring connections) using MongoServer.serverLock, while disconnection is handled using MongoConnection.connectionLock only and not all Disconnect call hierarchies are protected with the MongoServer.serverLock:
1. ReplicaSetConnector.ProcessAdditionalResponsesWorkItem is called from a thread pool thread that does not do any locking
2. MongoConnection.SendMessage calls HandleException under connection lock only
3. MongoConnection.ReceiveMessage calls HandleException under connection lock only

Generated at Wed Feb 07 21:36:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.