[CSHARP-724] MongoServer.IsPrimary Would Error When a Failover Occured Until the Old Primary Came Back Online Created: 12/Apr/13 Updated: 20/Mar/14 Resolved: 28/Jun/13 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | 1.8.2 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Christian KLAT | Assignee: | Robert Stam |
| Resolution: | Done | Votes: | 0 |
| Labels: | replicaset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
A Disconnected Primary MongoServerInstance ServerInformation is never resetted and though the MongoServer.Primary is never Single() The Issue is in MongoServerInstance.cs a lot of your actions start with a Ping() => Ping(MongoConnection connection) => in case of an Exception it throws and though the State is Never resetted. |
| Comments |
| Comment by Robert Stam [ 28/Jun/13 ] |
|
The race condition in the MongoServer Primary property has been resolved by explicitly keeping track of the current primary (see the new _primary field in ReplicaSetMongoServerProxy) instead of trying to figure out the current primary by looking at a list that at some points in time has stale data. The staleness is unavoidable because we refresh on a timer. |
| Comment by auto [ 28/Jun/13 ] |
|
Author: {u'username': u'rstam', u'name': u'rstam', u'email': u'robert@10gen.com'}Message: |
| Comment by auto [ 28/Jun/13 ] |
|
Author: {u'username': u'rstam', u'name': u'rstam', u'email': u'robert@10gen.com'}Message: |
| Comment by auto [ 29/Apr/13 ] |
|
Author: {u'date': u'2013-04-19T23:58:51Z', u'name': u'Sridhar Nanjundeswaran', u'email': u'sridhar@10gen.com'}Message: |
| Comment by Christian KLAT [ 15/Apr/13 ] |
|
No worries, I'll monitor the resoltion of the Jira and in the meantime we'll use a home baked implementation. |
| Comment by Craig Wilson [ 15/Apr/13 ] |
|
I agree that it is a regression and did not mean to imply we weren't going to correct the issue. Rather, what I'm stating is that this isn't a "major" issue as you claim. Internally, we failover properly and reads and writes are routed accurately as they should be. From my perspective, the only people who would be affected by this issue are those who are monitoring the Primary, Secondary, and Arbiter properties on MongoServer. Because of the failure to remove the "primary" designation from a disconnected member, there are 2 members that could have that designation. However, only one of these members has a state of "Connected". I have created a more comprehensive low-level fix for this such that the properties in MongoServer work correctly. However, it isn't likely that this fix will be immediate as we don't release for every bug fixed. I don't see us releasing a 1.8.2 for at least 2 - 3 weeks at the earliest. Unless we see something more major than this, it will probably not be until 1.9. I have provided you with 2 alternatives in the meantime. Either use the workaround stated above by using the Instances property and filtering out non-Connected instances, or, (and even better), use the replSetGetStatus command to do your monitoring of the replica set. Please correct me if this is something much more fundamental and horrible than what I'm seeing, hearing, and acknowledging. If this truly is a "major" issue that is affecting more than people using the MongoServer.Primary property, then we'd really like to know. Once again, I apologize for the inconvenience this has caused you. |
| Comment by Christian KLAT [ 15/Apr/13 ] |
|
To answer your question, we could filter on the MongoServer.Instances and watch for the primary which is connected, but why bother when we just need to ensure the LookupServerInformation gets called even if ping fails... |
| Comment by Christian KLAT [ 15/Apr/13 ] |
|
Hi Craig, Sorry I didn't explain the problem properly... The problem comes from your implementation in MongoServerInstance. It is a major issue. If you look at the code I provided you'll notice a major difference between the old implem. and the new one in the: public void VerifyState() {...}internal void Connect(...) The main goal is to ensure that if a Ping Fails we still recompute the ServerInformation and if it fails it resets the ServerInformation to: IsArbiter = false, Simply put, we always want to be in sync with what the ping sees and not throw immediately after a Ping() fails ==> it is inconsistent. Otherwise if the Primary, which has been killed, never goes up again another SINGLE UNIQUE one is setup in the MongoServer.cs Primary Property (it relies on MongoServer.Instances) Moreover, we've been using MongoDB and the 1.4.2 driver for more than a year, and the behavior was fully tested in a complex Integration Test. It fails with the new drivers after 1.4.2 ==> it is a regression. |
| Comment by Craig Wilson [ 15/Apr/13 ] |
|
Hi Christian. I can confirm this problem. Just as an FYI, this isn't a problem internally with our connection pooling. We don't actually recognize 2 primaries and failover happens correctly. Internally, only 1 of these 2 instances is marked as connected. Hence, to get the primary, we really should be checking for each instance to be connected and marked as primary. This is the problem with the MongoServer.Primary property. On a side note, these properties exist to publish our internal view of the replica set. It is entirely possible that our view of the world is not accurate with the state of the replica set. For instance, if we are simply unable to talk to one of the machines, we would show it in the state of disconnected and you would likely interpret that to mean "down". That isn't necessarily true. A better way to monitor the state of the replica set is to run the command "replSetGetStatus". Given this explanation, are you seeing a major issue with connectivity, or is it just the property that is causing you problems? You can workaround the property issue by using the MongoServer.Instances property, filtering them based on a "Connected" state, and then choosing the primary. Thanks for the tests and the patch. It definitely fixes this problem, but at a much deeper level than what I described above to workaround this issue. We will likely do something similar as a permanent fix for this issue. I'd still like to recommend using "replSetGetStatus" as the preferred way of checking the state of the replica set as it will be more accurate and less affected by our internal changes to how we monitor replica set state. |
| Comment by Christian KLAT [ 15/Apr/13 ] |
|
Correction |
| Comment by Christian KLAT [ 15/Apr/13 ] |
|
Here is a Test to validate ReplicaSet connectivity and Failover |