[CSHARP-724] MongoServer.IsPrimary Would Error When a Failover Occured Until the Old Primary Came Back Online Created: 12/Apr/13  Updated: 20/Mar/14  Resolved: 28/Jun/13

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: 1.8.2

Type: Bug Priority: Minor - P4
Reporter: Christian KLAT Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 0001-CSHARP-724-Major-Issue-in-the-Failover-scenario-Once.patch     Text File CSharp724Tests.cs     Text File MongoServerInstance.cs    
Issue Links:
Duplicate
is duplicated by CSHARP-755 MongoServer.Primary is not thread-saf... Closed
Related
is related to CSHARP-722 Provide a method to inspect the state... Closed

 Description   

A Disconnected Primary MongoServerInstance ServerInformation is never resetted and though the MongoServer.Primary is never Single()

The Issue is in MongoServerInstance.cs a lot of your actions start with a Ping() => Ping(MongoConnection connection) => in case of an Exception it throws and though the State is Never resetted.



 Comments   
Comment by Robert Stam [ 28/Jun/13 ]

The race condition in the MongoServer Primary property has been resolved by explicitly keeping track of the current primary (see the new _primary field in ReplicaSetMongoServerProxy) instead of trying to figure out the current primary by looking at a list that at some points in time has stale data. The staleness is unavoidable because we refresh on a timer.

Comment by auto [ 28/Jun/13 ]

Author:

{u'username': u'rstam', u'name': u'rstam', u'email': u'robert@10gen.com'}

Message: CSHARP-724: Now that we have more robust primary tracking use that information in more places. Also, reset state verification timer on old primary whenever we discover a new primary so that the new state of the old primary will be determined as soon as possible.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/c8b7e01c8a9bf164b0a1539f76e61d08a4c02d3c

Comment by auto [ 28/Jun/13 ]

Author:

{u'username': u'rstam', u'name': u'rstam', u'email': u'robert@10gen.com'}

Message: CSHARP-724: Make MongoServer Primary property robust.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/108be7e8cfbb2b7d2b607799679dff50ae8b1961

Comment by auto [ 29/Apr/13 ]

Author:

{u'date': u'2013-04-19T23:58:51Z', u'name': u'Sridhar Nanjundeswaran', u'email': u'sridhar@10gen.com'}

Message: CSHARP-724. Unset primary flag when instance disconnects or a new primary is elected
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/b5a8b3d12f2df6fac586299a6336dbbc11b4555e

Comment by Christian KLAT [ 15/Apr/13 ]

No worries, I'll monitor the resoltion of the Jira and in the meantime we'll use a home baked implementation.
Thanks for your time.

Comment by Craig Wilson [ 15/Apr/13 ]

I agree that it is a regression and did not mean to imply we weren't going to correct the issue. Rather, what I'm stating is that this isn't a "major" issue as you claim. Internally, we failover properly and reads and writes are routed accurately as they should be. From my perspective, the only people who would be affected by this issue are those who are monitoring the Primary, Secondary, and Arbiter properties on MongoServer. Because of the failure to remove the "primary" designation from a disconnected member, there are 2 members that could have that designation. However, only one of these members has a state of "Connected".

I have created a more comprehensive low-level fix for this such that the properties in MongoServer work correctly. However, it isn't likely that this fix will be immediate as we don't release for every bug fixed. I don't see us releasing a 1.8.2 for at least 2 - 3 weeks at the earliest. Unless we see something more major than this, it will probably not be until 1.9.

I have provided you with 2 alternatives in the meantime. Either use the workaround stated above by using the Instances property and filtering out non-Connected instances, or, (and even better), use the replSetGetStatus command to do your monitoring of the replica set.

Please correct me if this is something much more fundamental and horrible than what I'm seeing, hearing, and acknowledging. If this truly is a "major" issue that is affecting more than people using the MongoServer.Primary property, then we'd really like to know.

Once again, I apologize for the inconvenience this has caused you.

Comment by Christian KLAT [ 15/Apr/13 ]

To answer your question, we could filter on the MongoServer.Instances and watch for the primary which is connected, but why bother when we just need to ensure the LookupServerInformation gets called even if ping fails...
The LookupServerInformation handles properly "Unavailable instances" and resets the state.

Comment by Christian KLAT [ 15/Apr/13 ]

Hi Craig, Sorry I didn't explain the problem properly... The problem comes from your implementation in MongoServerInstance. It is a major issue. If you look at the code I provided you'll notice a major difference between the old implem. and the new one in the:

public void VerifyState()

{...}

internal void Connect(...)

The main goal is to ensure that if a Ping Fails we still recompute the ServerInformation and if it fails it resets the ServerInformation to:

IsArbiter = false,
IsMasterResult = isMasterResult,
IsPassive = false,
IsPrimary = false,
IsSecondary = false,

Simply put, we always want to be in sync with what the ping sees and not throw immediately after a Ping() fails ==> it is inconsistent.

Otherwise if the Primary, which has been killed, never goes up again another SINGLE UNIQUE one is setup in the MongoServer.cs Primary Property (it relies on MongoServer.Instances)

Moreover, we've been using MongoDB and the 1.4.2 driver for more than a year, and the behavior was fully tested in a complex Integration Test. It fails with the new drivers after 1.4.2 ==> it is a regression.

Comment by Craig Wilson [ 15/Apr/13 ]

Hi Christian. I can confirm this problem.

Just as an FYI, this isn't a problem internally with our connection pooling. We don't actually recognize 2 primaries and failover happens correctly. Internally, only 1 of these 2 instances is marked as connected. Hence, to get the primary, we really should be checking for each instance to be connected and marked as primary. This is the problem with the MongoServer.Primary property.

On a side note, these properties exist to publish our internal view of the replica set. It is entirely possible that our view of the world is not accurate with the state of the replica set. For instance, if we are simply unable to talk to one of the machines, we would show it in the state of disconnected and you would likely interpret that to mean "down". That isn't necessarily true. A better way to monitor the state of the replica set is to run the command "replSetGetStatus". CSHARP-722 is a ticket created for this very reason.

Given this explanation, are you seeing a major issue with connectivity, or is it just the property that is causing you problems? You can workaround the property issue by using the MongoServer.Instances property, filtering them based on a "Connected" state, and then choosing the primary.

Thanks for the tests and the patch. It definitely fixes this problem, but at a much deeper level than what I described above to workaround this issue. We will likely do something similar as a permanent fix for this issue. I'd still like to recommend using "replSetGetStatus" as the preferred way of checking the state of the replica set as it will be more accurate and less affected by our internal changes to how we monitor replica set state.

Comment by Christian KLAT [ 15/Apr/13 ]

Correction

Comment by Christian KLAT [ 15/Apr/13 ]

Here is a Test to validate ReplicaSet connectivity and Failover

Generated at Wed Feb 07 21:37:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.