[CSHARP-826] Driver could not reconnect to the server after socket exception, had to recycle application pool (still open) Created: 18/Sep/13 Updated: 04/Apr/15 Resolved: 04/Apr/15 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Stefano Ricciardi | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 12 |
| Labels: | driver, replicaset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Client on Windows 2008 R2. .NET environment: 4.5 |
||
| Issue Links: |
|
||||
| Description |
|
This is similar to Connection string like the following: The following 2 stack traces alternate: Stack trace 1: Stack trace 2: |
| Comments |
| Comment by Pietro Perugino [ 28/Jan/15 ] | |
|
I am having the same issues. Sporadically i get the following error: Unable to connect to a member of the replica set matching the read preference Primary I have only had my replicaset running for a few days, it will work fine sometimes, however randomly it will come up with this error and i'm unable to connect using my C# Driver. I'm sure my connections settings are ok because i can connect sometimes. Can someone point me in the right direction with this? Is there anything i need to check? All 3 of my nodes are on seperate ubuntue server machines, and they are all using the same port 27000 (because they are on different machines). Thanks for reading | |
| Comment by Victor Fedianine [ 08/Jan/15 ] | |
|
I seem to have a same problem. Around once every two weeks I start getting "Unable to connect to a member of the replica set matching the read preference Primary" error. This is happens on the "if (!db.GetCollection(tableName).Exists())" from following code: String connectionString = "mongodb://userName:password@FQDN1.domain.com,FQDN2.domain.com:27017/db221?replicaSet=rs0&readPreference=primary"; My setup is as follows - I have two MongoDB servers + an arbiter server (version 2.6.4) installed on Windows machines, all of which are hosted in a local data center. I run the code from the NServiceBus worker node (a service running on windows box) with MongoDB C# driver version 1.9.2.235. Due to the nature of the data that is coming in, every time I connect, I connect to a new database with a new set of credentials (old databases are being discarded and never used again) on average 30 times per day. The only way I've found to stop "Unable to connect to a member of the replica set matching the read preference Primary" errors is to restart the service that runs the worker node. What I am thinking, is that all of these unused connections are being piled up in my service (thinking they'll be reused) but in reality they seem to cause this problem. I also think it piles up the connections or does some kind of caching, because in DEV environment, where the setup is exactly the same and worker node is being redeployed/restarted daily, this problem never occurs. In order to get rid of these piled connections, I even tried implementing the code "MongoServer.UnregisterServer(db.Server);", but it didn't seem to help much. Any help? | |
| Comment by Mikey Reppy [ 09/Dec/14 ] | |
|
Same team as Bryan Grimes & Ben Miller. It sounds like fixing the replica set to use fqdns has fixed the IIS/C# application startup (using the replica set instead of the parallel single instance), so it sounds like we had more of a general misconfiguration of the replica set names throughout, rather than this timing bug. (I don't think this is related to C#, but in case it matters...) I restored host 2 by dropping and re-adding it. I could have sworn the first time I did this, even though I used rs.add('<FQDN>'), the replica member reported in rs.conf() used the internal amazon ip-name. However the second time I did a drop and add, with the FQDN, it stuck. I then used the rs.reconfig() method to rename the other secondary, stepped down the primary, connected to the new primary and fixed the remaining ip-based name with rs.reconfig() from the new primary. At that point rs.status() reported all 3 healthy, and using the FQDN's. Ben changed the app back to the 3 servers with FQDN connection string and it's now working, whereas even one user couldn't load the page previously. | |
| Comment by Craig Wilson [ 09/Dec/14 ] | |
|
Thanks Ben. First, I see that you have IP addresses inside your replica set config. We normally recommend using host names. I assume that these are static IP's and don't change. Second, you are using hostnames inside your connection string, which are different from the ip addresses. The driver monitors a replica set initially based on your connection string, but once it has discovered the servers, it uses the hosts field that is being reported to it. Your hosts fields are using ip addresses. So whether you are using ip addresses or hostnames in your connection string, the replica set is being monitored with ip addresses. So, there will obviously be problems if the ip addresses change. In a cloud like Amazon, this sometimes happens. For instance, I noticed that your 3rd member is DOWN. In some configurations of EC2, if this machine was stopped and brought back up, it might have a different IP address. Questions: 1) You mentioned you've tried this using both IP addresses and hostnames. What exactly is the error that occurs? When you use hostnames, is the error different or delayed slightly as if it kinda works and then gives up? You can test this by immediately doing a server.GetDatabaseNames() after creating your client. If it comes back, and then things stop working, it means that the ip addresses are able to be reached. 2) Is your client machine also in AWS? Are the IP addresses accessible from the client machine? Can you connect with a mongo shell or telnet into the mongo server using the ip addresses? If you telnet, you should see Escape character is '^]'. | |
| Comment by Ben Miller [ 09/Dec/14 ] | |
|
craiggwilson, I'm on the team with Bryan that encountered this issue-- I can give a follow up. We are able to connect to this replicaset with other systems and applications not using the C# Driver, without error. The application is connecting to a three-member Mongo replicaset using AWS with fully qualified DNS names, in the format of: We've debugged this application with the above connection string, trying both IP addresses and domain names. We also tried specifying only the primary member, again with both domain names and IP addresses. In all configurations, the error occurred. We're using C# Driver version 1.9.2.235, the latest release version from Nuget. Output of rs.config() { , , { "_id" : 4, "host" : "ip-<ip-address-2>.ec2.internal:27017" } ] Output of rs.status() { , , { "_id" : 4, "name" : "ip-<ip-address-2>.ec2.internal:27017", "health" : 1, "state" : 8, "stateStr" : "DOWN", "uptime" : 1618701, "optime" : Timestamp(0, 0), "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2014-12-09T11:53:37Z"), "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"), "pingMs" : 1, "lastHeartbeatMessage" : "still initializing" } ], | |
| Comment by Craig Wilson [ 08/Dec/14 ] | |
|
bryangrimes, could you expand on the "this just bit us" problem your seeing? It sounds like you're stating the problem as "the driver is attempting to resolve the hostname to an IP address." Could you confirm this is what you're stating and I'll address that. If not, please restate the issue. Also, some of the below information would be - useful.
| |
| Comment by Bryan Grimes [ 08/Dec/14 ] | |
|
This just bit us big time in moving and updating some older applications. Latest driver and MongoDB build, repl set with three members running in AWS (Amazon AMI to be specific). If more than a single connection was made to the repl set the driver would try to then resolve via hostname | |
| Comment by Vincent [ 19/Aug/14 ] | |
|
No, I'm not using AWS (and wasn't when I had the bug). Real dedicated servers for MongoDB and Windows Azure for hosting. | |
| Comment by Matt B [ 19/Aug/14 ] | |
|
Is it possible that those of us experiencing these issues are all working with AWS? There is a thread on the AWS forum apparently resolved in June involving opening ephemeral ports in security groups, but I'm not sure how many of those affected are on AWS. I know at least myself and the penultimate commenter were both using AWS. | |
| Comment by Matt B [ 16/Aug/14 ] | |
|
Confirming bug still exists. Our setup also uses Windows/Linux host entries, so there should not be any external DNS lookup. Full event error follows: 2014-08-15T22:15:28.000Z Warning 3:Web Event ASP.NET 4.0.30319.0 - Event code: 3005 Event message: An unhandled exception has occurred. Event time: 8/15/2014 10:15:28 PM Event time (UTC): 8/15/2014 10:15:28 PM Event ID: 5b1875c0508e4b01b671d70e498ea2e5 Event sequence: 608 Event occurrence: 28 Event detail code: 0 Application information: Application domain: /LM/W3SVC/1/ROOT-2-130525973702696742 Trust level: Full Application Virtual Path: / Application Path: C:\inetpub\wwwroot\ Machine name: AMAZONA-MBA838N Process information: Process ID: 1220 Process name: w3wp.exe Account name: IIS APPPOOL\DefaultAppPool Exception information: Exception type: MongoConnectionException Exception message: Unable to connect to a member of the replica set matching the read preference SecondaryPreferred: No such host is known. No such host is known | |
| Comment by Ron Rejwan [ 02/Jul/14 ] | |
|
I can confirm this bug still exists (C# driver version 1.9.2, DB version: 2.6.1, running on AWS). We ran a stress test on our dev environment on AWS, testing the auto scaling configuration, when a new instance that has opened started spewing thousands of these error messages: Unable to connect to a member of the replica set matching the read preference Primary MongoDB logs were clear, the other instances worked perfectly well, simply recycling the instance "fixed" the issue. | |
| Comment by Sean Chambers [ 14/May/14 ] | |
|
Was there any other progress or workarounds found with this? I'm experiencing this problem in our production environment for one of our applications. Mongo driver version: 1.8.2 server os: windows server 2008r2 | |
| Comment by Vincent [ 08/Dec/13 ] | |
|
News: even with IP addresses every where (in both en connection strings and the replicaset configuration), my website still goes down and I've to "restart" it (ie restart the app pool) to make it available again (but it recovers by itself after some time, let's say 30 minutes or so). It's like the driver is "hanging". | |
| Comment by Pavel Peker [ 06/Dec/13 ] | |
|
Got it. Thank you for your help. | |
| Comment by Vincent [ 06/Dec/13 ] | |
|
Even with priority 0 they still have their "voting rights", and if the master don't have the majority from the 2 others secondaries it won't be the master (and no server will be in fact, as the secondary have a 0 priority). What you can do: either give the master 3 votes so he can elect himself, or add an arbiter in the same DC as the master and give him 2 votes, so the master get the quorum when the other DC is disconnected. | |
| Comment by Pavel Peker [ 06/Dec/13 ] | |
|
I understand that. My primary datacenter has 2 mongo servers and secondary only one. We failed over to secondary datacenter and forced secondary datacenter mongo server to become master. I also changed priority on other 2 servers to be 0 to prevent them from becoming masters while we do network troubleshooting in primary datacenter. And every time we disconnect 2 datacenters we got error above. Web servers and primary mongo server are in same secondary datacenter. Primary mongo server always on. Just secondary mongo servers got disconnected. | |
| Comment by Craig Wilson [ 06/Dec/13 ] | |
|
Pavel, However, barring that change, you can always change your readPreference to primaryPreferrred. This would mean that it's ok to read from a secondary when no primary exists. However, without a primary, you still cannot write. Craig | |
| Comment by Pavel Peker [ 06/Dec/13 ] | |
|
It is correct. Error went away when connection got restored. As I understand in this situation when all slave servers got disconnected. Do I have to modify master replica set configuration and remove slave servers from it or somehow can be handled in connection string? | |
| Comment by Vincent [ 06/Dec/13 ] | |
|
@Craig, that's exactly what I was about to say, until I realize he was probably saying that the error remained even after the connexion between the datacenters went up again (thus the master was really the master again ^^). | |
| Comment by Craig Wilson [ 06/Dec/13 ] | |
|
Pavel, | |
| Comment by Pavel Peker [ 06/Dec/13 ] | |
|
I got slightly different call stack when web server lost connection to slave mongo servers. Restart of application pool fixed the problem. web server: windows 2008 R2 Mongo replica set is running on 3 mongo servers. mdb2pln0, mdb2pln2 (slaves) are located in one datacenter, mdb2dfw0 (master) is located in other datacenter. Same datacenter as web server. When connection went down between datacenters, my web servers started to report following error, even master server was still online and available for web server. MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host | |
| Comment by Craig Wilson [ 06/Dec/13 ] | |
|
This is why we are leaving the ticket open. There certainly seems to be something going on, but we can't reproduce it and without being able to do that, it is unlikely that we can fix it. If you believe that the connection string is the difference, then perhaps you should make your app connection string look like the log4net connection string. Specifically, remove the replica set option. This will force the driver to discover that it is a replica set and then the two applications will be running through the same code paths. | |
| Comment by Stefano Ricciardi [ 06/Dec/13 ] | |
|
Craig, this difference between log4net and the rest of the application happens all All of the incidents that has happened recently have been at night, when We are investigating about enabling tracing, but since we have disabled Thanks, – | |
| Comment by Craig Wilson [ 06/Dec/13 ] | |
|
The only difference at the driver level is that one is running through a replica set and the log4net one had to discover that it is a replica set. I don't believe this is relevant, but there is likely a timing issue involved. I can't necessarily explain it and I'm sorry about that. I did request that you (or whomever) follow two items that would help us diagnose this issue. I'll repeat them below:
| |
| Comment by Stefano Ricciardi [ 06/Dec/13 ] | |
|
Craig, please read again my message from Sep 23 to which you didn't fully reply. We have configured (by mistake) log4net to write to Mongo with a slightly different connection string than the one use by rest of our application. And when the rest of the application reports DNS issues, log4net still continues to be able to write to the primary (ironically writing logs from the application saying that the primary cannot be reached). What's the difference between those 2 connection strings that make one work and not the other? Or is it something related to log4net MongoDB appender, the way it creates server and client objects? | |
| Comment by Vincent [ 06/Dec/13 ] | |
|
I'll try my best to get a reproducible failure case. FYI, I'm using Windows Azure (Websites), and when I've multiple instances launched they all fall down almost simultaneously (so the bug is definitively related to something out of the instance itself, like a network failure or something). I also have a Worker that goes down also. | |
| Comment by Craig Wilson [ 05/Dec/13 ] | |
|
Vincent, Craig | |
| Comment by Vincent [ 05/Dec/13 ] | |
|
Hi Craig, | |
| Comment by Craig Wilson [ 05/Dec/13 ] | |
|
Hi Vincent, As this exception emanates from the System.Net.Dns class (not our code), searching around for this particular exception (System.Net.Sockets.SocketException: The requested name is valid, but no data of the requested type was found) shows that it isn't just us with this problem. It truly is related to a DNS system which is outside of our control. Craig | |
| Comment by Vincent [ 05/Dec/13 ] | |
|
I'm not able to reproduce the bug "on demand" either, but this bug is REALLY critical and REALLY needs more consideration, as it's making entire websites down for very long times, and very hard to diagnostic by the developers. | |
| Comment by Stefano Ricciardi [ 11/Oct/13 ] | |
|
Karl, could you please share the steps you go through to reproduce the issue? It only happens once in a while in our production environment, but we did not manage to reproduce in our test environment yet. Stefano | |
| Comment by Karl Stoney [ 07/Oct/13 ] | |
|
Unfortunately not.
| |
| Comment by Craig Wilson [ 07/Oct/13 ] | |
|
The attempt to resolve a hostname is probably due to the fact that you are using a replica set. I'll bet that your replSetConfig (from mongo shell: > rs.config()), contains the canonical hostnames and not the ip addresses. As such, even if you initially specify an ip address, we are going to use whatever is in the replica set config to contact the members of the replica set. This is simply how it works and all drivers do this. So, what will happen is this: 1) We start connect using the ip addresses you provide on the seed list. and discover that none of the ip addresses on the seed list are in the isMaster response. If the hostnames can't resolve, then you might have been able to connect and even do a couple of operations before things go wonky. | |
| Comment by Karl Stoney [ 07/Oct/13 ] | |
|
Irritatingly i'm now unable to reproduce it.... yay for intermittent errors | |
| Comment by Karl Stoney [ 07/Oct/13 ] | |
|
What i find interesting is that it's trying to resolve a hostname, even though my connection string is specifying an IP. | |
| Comment by Karl Stoney [ 07/Oct/13 ] | |
|
Hi mate, [SocketException (0x2afc): The requested name is valid, but no data of the requested type was found] [MongoConnectionException: Unable to connect to server IP:PORT: The requested name is valid, but no data of the requested type was found.] | |
| Comment by Craig Wilson [ 07/Oct/13 ] | |
|
Hi Karl, Thanks. | |
| Comment by Karl Stoney [ 05/Oct/13 ] | |
|
Hi, To try and rule out dns issues i've also used an ip in my rs configuration but to no avail. If there's you'd like from me to help you get to the bottom of this issue, just let me know (happy to even share my screen over skype or something if it helps) - just drop me an email. Karl | |
| Comment by Craig Wilson [ 04/Oct/13 ] | |
|
Sorry, but no. We've attempted to reproduce this and are thus far unable to. There seems to, as I mentioned prior, to be something going on with DNS. As I mentioned before, a couple of things would help us out:
| |
| Comment by Stefano Ricciardi [ 04/Oct/13 ] | |
|
Craig, any news on this issue? | |
| Comment by Stefano Ricciardi [ 23/Sep/13 ] | |
|
Craig, the deployment is composed by 4 web applications running on IIS hitting a Mongo replica-set comprising 2 Linux boxes on CentOs plus 1arbiter on Windows. All boxes run on Rackspace managed hosting. The guys from operations confirmed we use A records. We have switched to C# driver 1.8.1 more one month ago, and since then we had to restart the application pool on each machine at least once since they would not connect to the replica set anymore after some network glitch (sometimes it took us many hours to realize that we had a problem with one of the machines, since our monitoring service reported both Mongo and the web applications as running). Sometimes we have issues on more than one machine at the same time, most of the times the issue is isolated to one machine only. Ironically enough we do have tons of application logs on Mongo itseif - written by the log4net Mongo appender - saying that the application cannot connect to Mongo: that is, log4net is somehow able to connect to Mongo while the rest of our application cannot. The connection string we have used for log4net (probably due to a copy and paste error) points to the same members but hasn't got the "replicaSet" parameter in it. Does the C# driver treat the two connection strings any different (other than the timeout and pool size parameters)? Just to make the difference more clear:
Another difference might also be in the way we create database instances in the application code as opposed to log4net:
| |
| Comment by Craig Wilson [ 20/Sep/13 ] | |
|
The alternating makes sense. Basically, what has happened is that we have attempted to open a new connection to the Primary, but Dns has failed to resolve the primaries ip address by the hostname we have on file. At that point, you'll get the SocketException. Then, in the period between our heartbeats (attempts to reconnect), other threads won't be able to get a connection to the primary. So, the alternating makes sense. I have some follow-up questions:
The problem here (and in looking back at
Also, if you wish, you can enable network tracing and see if that points out anything going on in the networking stack. |