[CSHARP-826] Driver could not reconnect to the server after socket exception, had to recycle application pool (still open) Created: 18/Sep/13  Updated: 04/Apr/15  Resolved: 04/Apr/15

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stefano Ricciardi Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 12
Labels: driver, replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Client on Windows 2008 R2. .NET environment: 4.5
Server on CentOS.


Issue Links:
Related

 Description   

This is similar to CSHARP-754, but with a different call stack.
Like in that bug, the only way we have found to fix the issue is to recycle the application pool of the web service hitting Mongo.

Connection string like the following:
mongodb://host1:27017,host2:27017,arbiterOnWindows:27017/?replicaSet=my_replica_set;socketTimeout=120000ms;maxPoolSize=200

The following 2 stack traces alternate:

Stack trace 1:
2013-09-18 13:14:39,945 ERROR - Unable to connect to a member of the replica set matching the read preference Primary: The requested name is valid, but no data of the requested type was found.
MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary: The requested name is valid, but no data of the requested type was found. ---> System.Net.Sockets.SocketException: The requested name is valid, but no data of the requested type was found
at System.Net.Dns.GetAddrInfo(String name)
at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at MongoDB.Driver.MongoServerAddress.ToIPEndPoint(AddressFamily addressFamily)
at MongoDB.Driver.MongoServerInstance.GetIPEndPoint()
at MongoDB.Driver.Internal.MongoConnection.Open()
at MongoDB.Driver.Internal.MongoConnection.SendMessage(BsonBuffer buffer, MongoRequestMessage message, WriteConcern writeConcern, String databaseName)
at MongoDB.Driver.Internal.MongoConnection.SendMessage(MongoRequestMessage message, WriteConcern writeConcern, String databaseName)
at MongoDB.Driver.Internal.MongoConnection.RunCommandAs[TCommandResult](String databaseName, QueryFlags queryFlags, CommandDocument command, IBsonSerializer commandResultSerializer, IBsonSerializationOptions commandResultSerializationOptions, Boolean throwOnError)
at MongoDB.Driver.Internal.MongoConnection.RunCommandAs[TCommandResult](String databaseName, QueryFlags queryFlags, CommandDocument command, Boolean throwOnError)
at MongoDB.Driver.MongoServerInstance.Ping(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Connect()
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.<>c_DisplayClassd.<ConnectInstance>b_c(Object _)
— End of inner exception stack trace —
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference)
at MongoDB.Driver.MongoCursorEnumerator`1.AcquireConnection()
at MongoDB.Driver.MongoCursorEnumerator`1.GetFirst()
at MongoDB.Driver.MongoCursorEnumerator`1.MoveNext()
at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)

Stack trace 2:
2013-09-18 13:13:44,735 ERROR - Unable to connect to a member of the replica set matching the read preference Primary
MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference)
at MongoDB.Driver.MongoCursorEnumerator`1.AcquireConnection()
at MongoDB.Driver.MongoCursorEnumerator`1.GetFirst()
at MongoDB.Driver.MongoCursorEnumerator`1.MoveNext()
at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)



 Comments   
Comment by Pietro Perugino [ 28/Jan/15 ]

I am having the same issues. Sporadically i get the following error:

Unable to connect to a member of the replica set matching the read preference Primary

I have only had my replicaset running for a few days, it will work fine sometimes, however randomly it will come up with this error and i'm unable to connect using my C# Driver. I'm sure my connections settings are ok because i can connect sometimes.

Can someone point me in the right direction with this? Is there anything i need to check?

All 3 of my nodes are on seperate ubuntue server machines, and they are all using the same port 27000 (because they are on different machines).

Thanks for reading

Comment by Victor Fedianine [ 08/Jan/15 ]

I seem to have a same problem. Around once every two weeks I start getting "Unable to connect to a member of the replica set matching the read preference Primary" error. This is happens on the "if (!db.GetCollection(tableName).Exists())" from following code:

String connectionString = "mongodb://userName:password@FQDN1.domain.com,FQDN2.domain.com:27017/db221?replicaSet=rs0&readPreference=primary";
MongoDatabase db = new MongoClient(connectionString).GetServer().GetDatabase(databaseName, WriteConcern.Unacknowledged);
if (!db.GetCollection(tableName).Exists())

{ ... }

My setup is as follows - I have two MongoDB servers + an arbiter server (version 2.6.4) installed on Windows machines, all of which are hosted in a local data center. I run the code from the NServiceBus worker node (a service running on windows box) with MongoDB C# driver version 1.9.2.235. Due to the nature of the data that is coming in, every time I connect, I connect to a new database with a new set of credentials (old databases are being discarded and never used again) on average 30 times per day.

The only way I've found to stop "Unable to connect to a member of the replica set matching the read preference Primary" errors is to restart the service that runs the worker node. What I am thinking, is that all of these unused connections are being piled up in my service (thinking they'll be reused) but in reality they seem to cause this problem. I also think it piles up the connections or does some kind of caching, because in DEV environment, where the setup is exactly the same and worker node is being redeployed/restarted daily, this problem never occurs.

In order to get rid of these piled connections, I even tried implementing the code "MongoServer.UnregisterServer(db.Server);", but it didn't seem to help much.

Any help?

Comment by Mikey Reppy [ 09/Dec/14 ]

Same team as Bryan Grimes & Ben Miller. It sounds like fixing the replica set to use fqdns has fixed the IIS/C# application startup (using the replica set instead of the parallel single instance), so it sounds like we had more of a general misconfiguration of the replica set names throughout, rather than this timing bug.

(I don't think this is related to C#, but in case it matters...) I restored host 2 by dropping and re-adding it. I could have sworn the first time I did this, even though I used rs.add('<FQDN>'), the replica member reported in rs.conf() used the internal amazon ip-name. However the second time I did a drop and add, with the FQDN, it stuck. I then used the rs.reconfig() method to rename the other secondary, stepped down the primary, connected to the new primary and fixed the remaining ip-based name with rs.reconfig() from the new primary.

At that point rs.status() reported all 3 healthy, and using the FQDN's. Ben changed the app back to the 3 servers with FQDN connection string and it's now working, whereas even one user couldn't load the page previously.
Thanks for the help!

Comment by Craig Wilson [ 09/Dec/14 ]

Thanks Ben.

First, I see that you have IP addresses inside your replica set config. We normally recommend using host names. I assume that these are static IP's and don't change. Second, you are using hostnames inside your connection string, which are different from the ip addresses. The driver monitors a replica set initially based on your connection string, but once it has discovered the servers, it uses the hosts field that is being reported to it. Your hosts fields are using ip addresses. So whether you are using ip addresses or hostnames in your connection string, the replica set is being monitored with ip addresses.

So, there will obviously be problems if the ip addresses change. In a cloud like Amazon, this sometimes happens. For instance, I noticed that your 3rd member is DOWN. In some configurations of EC2, if this machine was stopped and brought back up, it might have a different IP address.

Questions:

1) You mentioned you've tried this using both IP addresses and hostnames. What exactly is the error that occurs? When you use hostnames, is the error different or delayed slightly as if it kinda works and then gives up? You can test this by immediately doing a server.GetDatabaseNames() after creating your client. If it comes back, and then things stop working, it means that the ip addresses are able to be reached.

2) Is your client machine also in AWS? Are the IP addresses accessible from the client machine? Can you connect with a mongo shell or telnet into the mongo server using the ip addresses? If you telnet, you should see Escape character is '^]'.

Comment by Ben Miller [ 09/Dec/14 ]

craiggwilson, I'm on the team with Bryan that encountered this issue-- I can give a follow up.

We are able to connect to this replicaset with other systems and applications not using the C# Driver, without error.

The application is connecting to a three-member Mongo replicaset using AWS with fully qualified DNS names, in the format of:
mongodb://<fqdn1>:27017,<fqdn2>:27017,<fqdn3>:27017/<dbname>?connect=replicaset;replicaset=<rsname>;readPreference=primaryPreferred

We've debugged this application with the above connection string, trying both IP addresses and domain names. We also tried specifying only the primary member, again with both domain names and IP addresses. In all configurations, the error occurred.

We're using C# Driver version 1.9.2.235, the latest release version from Nuget.

Output of rs.config()

{
"_id" : "<rsname>",
"version" : 81853,
"members" : [

{ "_id" : 0, "host" : "ip-<ip-address-0>.ec2.internal:27017" }

,

{ "_id" : 3, "host" : "ip-<ip-address-1>.ec2.internal:27017" }

,

{ "_id" : 4, "host" : "ip-<ip-address-2>.ec2.internal:27017" }

]
}

Output of rs.status()

{
"set" : "<rsname>",
"date" : ISODate("2014-12-09T11:53:38Z"),
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "ip-<ip-address-0>.ec2.internal:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 1675012, "optime" : Timestamp(1418119585, 1), "optimeDate" : ISODate("2014-12-09T10:06:25Z"), "electionTime" : Timestamp(1416493340, 1), "electionDate" : ISODate("2014-11-20T14:22:20Z"), "self" : true }

,

{ "_id" : 3, "name" : "ip-<ip-address-1>.ec2.internal:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1619338, "optime" : Timestamp(1418119585, 1), "optimeDate" : ISODate("2014-12-09T10:06:25Z"), "lastHeartbeat" : ISODate("2014-12-09T11:53:36Z"), "lastHeartbeatRecv" : ISODate("2014-12-09T11:53:36Z"), "pingMs" : 1, "syncingTo" : "ip-<ip-address-0>.ec2.internal:27017" }

,

{ "_id" : 4, "name" : "ip-<ip-address-2>.ec2.internal:27017", "health" : 1, "state" : 8, "stateStr" : "DOWN", "uptime" : 1618701, "optime" : Timestamp(0, 0), "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2014-12-09T11:53:37Z"), "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"), "pingMs" : 1, "lastHeartbeatMessage" : "still initializing" }

],
"ok" : 1
}

Comment by Craig Wilson [ 08/Dec/14 ]

bryangrimes, could you expand on the "this just bit us" problem your seeing? It sounds like you're stating the problem as "the driver is attempting to resolve the hostname to an IP address." Could you confirm this is what you're stating and I'll address that. If not, please restate the issue. Also, some of the below information would be - useful.

  • What version of the driver were you using?
  • What is your connection string?
  • Could you provide the output of rs.config() from a shell connected to one of the members?
  • Could you provide the output of rs.status() from a shell connected to one of the members?
Comment by Bryan Grimes [ 08/Dec/14 ]

This just bit us big time in moving and updating some older applications. Latest driver and MongoDB build, repl set with three members running in AWS (Amazon AMI to be specific). If more than a single connection was made to the repl set the driver would try to then resolve via hostname and would fail with error. We just pushed up a parallel setup of the same (latest) MongoDB version and driver version, without replication which completely works. So now we're stuck, we obviously need replication for a production system but must either downgrade one of both components (the driver and/or MongoDB version) or figure out a Plan B. This is a huge problem for any of our applications that are sitting on older versions.

Comment by Vincent [ 19/Aug/14 ]

No, I'm not using AWS (and wasn't when I had the bug). Real dedicated servers for MongoDB and Windows Azure for hosting.
However I was using a cheap provider (Kimsufi.com), so maybe they had some random connectivity issues as well. Bug the bug stopped for me before I switch.

Comment by Matt B [ 19/Aug/14 ]

Is it possible that those of us experiencing these issues are all working with AWS?

There is a thread on the AWS forum apparently resolved in June involving opening ephemeral ports in security groups, but I'm not sure how many of those affected are on AWS. I know at least myself and the penultimate commenter were both using AWS.

https://forums.aws.amazon.com/thread.jspa?messageID=454155

Comment by Matt B [ 16/Aug/14 ]

Confirming bug still exists. Our setup also uses Windows/Linux host entries, so there should not be any external DNS lookup.

Full event error follows:

2014-08-15T22:15:28.000Z Warning 3:Web Event ASP.NET 4.0.30319.0 - Event code: 3005

Event message: An unhandled exception has occurred.

Event time: 8/15/2014 10:15:28 PM

Event time (UTC): 8/15/2014 10:15:28 PM

Event ID: 5b1875c0508e4b01b671d70e498ea2e5

Event sequence: 608

Event occurrence: 28

Event detail code: 0

Application information:

Application domain: /LM/W3SVC/1/ROOT-2-130525973702696742

Trust level: Full

Application Virtual Path: /

Application Path: C:\inetpub\wwwroot\

Machine name: AMAZONA-MBA838N

Process information:

Process ID: 1220

Process name: w3wp.exe

Account name: IIS APPPOOL\DefaultAppPool

Exception information:

Exception type: MongoConnectionException

Exception message: Unable to connect to a member of the replica set matching the read preference SecondaryPreferred: No such host is known.
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference)
at MongoDB.Driver.MongoCursor`1.MongoCursorConnectionProvider.AcquireConnection()
at MongoDB.Driver.Operations.QueryOperation`1.GetFirstBatch(IConnectionProvider connectionProvider)
at MongoDB.Driver.Operations.QueryOperation`1.Execute(IConnectionProvider connectionProvider)
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
at Wellness.Framework.Persistence.MongoDb.EagerCategoryRepository..ctor(MongoDatabase mongo, CategoryFactory categoryFactory, ICacheRepository cacheRepository) in c:\BuildAgent\work\cdc039e439bff215\Framework\Src\Persistence\MongoDb\Categories\EagerCategoryRepository.cs:line 37
at lambda_method(Closure , IBuilderContext )
at Microsoft.Practices.ObjectBuilder2.DynamicBuildPlanGenerationContext.<>c_DisplayClass1.<GetBuildMethod>b_0(IBuilderContext context)
at Microsoft.Practices.ObjectBuilder2.BuildPlanStrategy.PreBuildUp(IBuilderContext context)
at Microsoft.Practices.ObjectBuilder2.StrategyChain.ExecuteBuildUp(IBuilderContext context)
at Microsoft.Practices.ObjectBuilder2.BuilderContext.NewBuildUp(NamedTypeBuildKey newBuildKey)
at lambda_method(Closure , IBuilderContext )
at Microsoft.Practices.ObjectBuilder2.DynamicBuildPlanGenerationContext.<>c_DisplayClass1.<GetBuildMethod>b_0(IBuilderContext context)
at Microsoft.Practices.ObjectBuilder2.BuildPlanStrategy.PreBuildUp(IBuilderContext context)
at Microsoft.Practices.ObjectBuilder2.StrategyChain.ExecuteBuildUp(IBuilderContext context)
at Microsoft.Practices.Unity.UnityContainer.DoBuildUp(Type t, Object existing, String name, IEnumerable`1 resolverOverrides)

No such host is known
at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at MongoDB.Driver.MongoServerAddress.ToIPEndPoint(AddressFamily addressFamily)
at MongoDB.Driver.Internal.MongoConnection.Open()
at MongoDB.Driver.Internal.MongoConnection.SendMessage(BsonBuffer buffer, Int32 requestId)
at MongoDB.Driver.Internal.MongoConnection.SendMessage(MongoRequestMessage message)
at MongoDB.Driver.Operations.CommandOperation`1.Execute(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Ping(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Connect()
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.<>c_DisplayClassd.<ConnectInstance>b_c(Object _)

Comment by Ron Rejwan [ 02/Jul/14 ]

I can confirm this bug still exists (C# driver version 1.9.2, DB version: 2.6.1, running on AWS).

We ran a stress test on our dev environment on AWS, testing the auto scaling configuration, when a new instance that has opened started spewing thousands of these error messages:

Unable to connect to a member of the replica set matching the read preference Primary
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference)
at MongoDB.Driver.MongoCursor`1.MongoCursorConnectionProvider.AcquireConnection()
at MongoDB.Driver.Operations.QueryOperation`1.GetFirstBatch(IConnectionProvider connectionProvider)
at MongoDB.Driver.Operations.QueryOperation`1.Execute(IConnectionProvider connectionProvider)
at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source)
at MongoDB.Driver.MongoCollection.FindOneAs[TDocument](FindOneArgs args)

MongoDB logs were clear, the other instances worked perfectly well, simply recycling the instance "fixed" the issue.

Comment by Sean Chambers [ 14/May/14 ]

Was there any other progress or workarounds found with this?

I'm experiencing this problem in our production environment for one of our applications.

Mongo driver version: 1.8.2
Mongo server version: 2.2.2

server os: windows server 2008r2
application: .Net 4.0, Asp.net mvc 3

Comment by Vincent [ 08/Dec/13 ]

News: even with IP addresses every where (in both en connection strings and the replicaset configuration), my website still goes down and I've to "restart" it (ie restart the app pool) to make it available again (but it recovers by itself after some time, let's say 30 minutes or so). It's like the driver is "hanging".
I tried to reproduce the problem using iptables rules, but it works as expected (ie not crashing the website). So annoying...
Here is the rule I tried to "crash" the system, on both the master and secondary:
iptables -A INPUT -p tcp -m multiport --dports 27017,27018,27019,28017,28018,28019 -j DROP

Comment by Pavel Peker [ 06/Dec/13 ]

Got it. Thank you for your help.

Comment by Vincent [ 06/Dec/13 ]

Even with priority 0 they still have their "voting rights", and if the master don't have the majority from the 2 others secondaries it won't be the master (and no server will be in fact, as the secondary have a 0 priority). What you can do: either give the master 3 votes so he can elect himself, or add an arbiter in the same DC as the master and give him 2 votes, so the master get the quorum when the other DC is disconnected.

Comment by Pavel Peker [ 06/Dec/13 ]

I understand that. My primary datacenter has 2 mongo servers and secondary only one. We failed over to secondary datacenter and forced secondary datacenter mongo server to become master. I also changed priority on other 2 servers to be 0 to prevent them from becoming masters while we do network troubleshooting in primary datacenter. And every time we disconnect 2 datacenters we got error above. Web servers and primary mongo server are in same secondary datacenter. Primary mongo server always on. Just secondary mongo servers got disconnected.

Comment by Craig Wilson [ 06/Dec/13 ]

Pavel,
This is just how MongoDB works. It requires a strict majority of replica set members to be available in order to have a primary. Since you have put a majority of members in a different data center than your web application, this is going to happen. I'd suggest you move one of your secondaries into the same data center as your web application so this doesn't happen. Then, when the other DC goes down, you still have a strict majority of members in your active DC.

However, barring that change, you can always change your readPreference to primaryPreferrred. This would mean that it's ok to read from a secondary when no primary exists. However, without a primary, you still cannot write.

Craig

Comment by Pavel Peker [ 06/Dec/13 ]

It is correct. Error went away when connection got restored. As I understand in this situation when all slave servers got disconnected. Do I have to modify master replica set configuration and remove slave servers from it or somehow can be handled in connection string?

Comment by Vincent [ 06/Dec/13 ]

@Craig, that's exactly what I was about to say, until I realize he was probably saying that the error remained even after the connexion between the datacenters went up again (thus the master was really the master again ^^).

Comment by Craig Wilson [ 06/Dec/13 ]

Pavel,
Your situation is very different and is completely expected. In your case, you lost both your secondaries which means a primary can't exist and therefore don't have a primary to talk to at all. The error is a bit misleading as it is reporting the first error it sees when it can't satisfy your read preference and this error happens to be from your secondaries. Regardless, this is exactly what I would expect to see in your case.

Comment by Pavel Peker [ 06/Dec/13 ]

I got slightly different call stack when web server lost connection to slave mongo servers. Restart of application pool fixed the problem.

web server: windows 2008 R2
.NET: 4.0
mongo driver: 1.8.2.34
mongo server: 2.2.4
mongo connection string is : mongodb://mdb2pln0,mdb2dfw0/?safe=true;slaveOk=false

Mongo replica set is running on 3 mongo servers. mdb2pln0, mdb2pln2 (slaves) are located in one datacenter, mdb2dfw0 (master) is located in other datacenter. Same datacenter as web server. When connection went down between datacenters, my web servers started to report following error, even master server was still online and available for web server.

MongoDB.Driver.MongoConnectionException: Unable to connect to a member of the replica set matching the read preference Primary: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
at MongoDB.Bson.IO.ByteBufferFactory.LoadFrom(Stream stream)
at MongoDB.Driver.Internal.MongoConnection.ReceiveMessage[TDocument](BsonBinaryReaderSettings readerSettings, IBsonSerializer serializer, IBsonSerializationOptions serializationOptions)
at MongoDB.Driver.Operations.CommandOperation`1.Execute(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Ping(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Connect()
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.<>c_DisplayClassd.<ConnectInstance>b_c(Object _)
— End of inner exception stack trace —
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ThrowConnectionException(ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.MultipleInstanceMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference)
at MongoDB.Driver.MongoCollection.RunCommandAs[TCommandResult](IMongoCommand command, IBsonSerializer resultSerializer, IBsonSerializationOptions resultSerializationOptions)
at MongoDB.Driver.MongoCollection.RunCommandAs[TCommandResult](IMongoCommand command)
at MongoDB.Driver.MongoCollection.Count(IMongoQuery query)
at MongoDB.Driver.GridFS.MongoGridFS.Exists(IMongoQuery query)

Comment by Craig Wilson [ 06/Dec/13 ]

This is why we are leaving the ticket open. There certainly seems to be something going on, but we can't reproduce it and without being able to do that, it is unlikely that we can fix it.

If you believe that the connection string is the difference, then perhaps you should make your app connection string look like the log4net connection string. Specifically, remove the replica set option. This will force the driver to discover that it is a replica set and then the two applications will be running through the same code paths.

Comment by Stefano Ricciardi [ 06/Dec/13 ]

Craig,

this difference between log4net and the rest of the application happens all
the times we have incurred in the problem, so it doesn't seem to be by
chance.

All of the incidents that has happened recently have been at night, when
our colleagues from operations got paged by the monitoring system. Since
nobody from the application team was around, they decided to bring the web
site to proper function as soon as possible by recycling the app pool.

We are investigating about enabling tracing, but since we have disabled
tracing in our build that would require to deploy a new version to
production, which can only happen during specific windows this year.

Thanks,
Stefano


Stefano Ricciardi
http://www.stefanoricciardi.com

Comment by Craig Wilson [ 06/Dec/13 ]

The only difference at the driver level is that one is running through a replica set and the log4net one had to discover that it is a replica set. I don't believe this is relevant, but there is likely a timing issue involved. I can't necessarily explain it and I'm sorry about that. I did request that you (or whomever) follow two items that would help us diagnose this issue. I'll repeat them below:

  1. When this happens the next time, instead of recycling the IIS app pool, attempt to do an ipconfig /flushdns. In theory, this will cause the system DNS to refresh itself and maybe get fix the "requested name is valid..." issue.
  2. Enable network tracing and see if that points out anything going on in the networking stack. As this error ultimately emanates from outside the driver and then flows through, it'd be extremely helpful to see what is going on down there.
Comment by Stefano Ricciardi [ 06/Dec/13 ]

Craig,

please read again my message from Sep 23 to which you didn't fully reply.

We have configured (by mistake) log4net to write to Mongo with a slightly different connection string than the one use by rest of our application. And when the rest of the application reports DNS issues, log4net still continues to be able to write to the primary (ironically writing logs from the application saying that the primary cannot be reached).

What's the difference between those 2 connection strings that make one work and not the other? Or is it something related to log4net MongoDB appender, the way it creates server and client objects?

Comment by Vincent [ 06/Dec/13 ]

I'll try my best to get a reproducible failure case. FYI, I'm using Windows Azure (Websites), and when I've multiple instances launched they all fall down almost simultaneously (so the bug is definitively related to something out of the instance itself, like a network failure or something). I also have a Worker that goes down also.

Comment by Craig Wilson [ 05/Dec/13 ]

Vincent,
We currently don't believe that there is anything we can actually do in the driver to prevent this. Caching an IP (which we already do and are going to remove soon - CSHARP-869) wouldn't fix problem. We certainly could implement Dns resolution manually and not run through the .NET framework, but that seems pretty extreme and certainly not guaranteed to solve the problem. We are leaving this ticket open to promote discussion and are certainly open to extreme solutions if they solve the issue, but we'd need more proof that there is indeed a bug in the framework we need to work around.

Craig

Comment by Vincent [ 05/Dec/13 ]

Hi Craig,
Maybe you could imagine something that deals with the System.Net.Dns issues? By converting/caching hosts into IP at the driver level for example?
This looks like a bug at the .net framework level, so we shouldn't expect having it corrected soon, isn't? So I think the driver have to deal with it, the developer can't really do anything at his level (for him, the exception is thrown by a problem at the driver level) except using IP instead of hostnames (maybe the c# driver should state this somewhere) – and only if his is aware of this bug.

Comment by Craig Wilson [ 05/Dec/13 ]

Hi Vincent,
We understand that this is a critical problem and have treated it as such. I have spent a bit of time attempting to reproduce this to no avail, and no reporters are able to reproduce in a manner that helps us identify a problem in our code.

As this exception emanates from the System.Net.Dns class (not our code), searching around for this particular exception (System.Net.Sockets.SocketException: The requested name is valid, but no data of the requested type was found) shows that it isn't just us with this problem. It truly is related to a DNS system which is outside of our control.

Craig

Comment by Vincent [ 05/Dec/13 ]

I'm not able to reproduce the bug "on demand" either, but this bug is REALLY critical and REALLY needs more consideration, as it's making entire websites down for very long times, and very hard to diagnostic by the developers.
I had this problem multiple times on my Windows Azure Website connecting to my OVH hosted MongoDB replica set (Debian x64, primary is kernel 3.8.13-xxxx-grs-ipv6-64 and secondary 3.11-2-amd64).
Azure <-> OVH connection often drops and was leaving the full site unavailable until I "restart" the website. It was driving me crazy until I realize the problem was caused by the MongoDB driver and found this bug.
I replaced all hostnames by IP yesterday (both in rs.conf() and connection strings) and it looks like it solved the issue, but it's still "in probation".
I'll keep you posted if this fix really walk over the issue or not.

Comment by Stefano Ricciardi [ 11/Oct/13 ]

Karl,

could you please share the steps you go through to reproduce the issue? It only happens once in a while in our production environment, but we did not manage to reproduce in our test environment yet.

Stefano

Comment by Karl Stoney [ 07/Oct/13 ]

Unfortunately not.
My thoughts were originally down the same lines, as a result I have updated the replicaset to be IP, not hostname.
I only run a single server (the replicaset is to support ElasticSearch + MongoDB river).

Comment by Craig Wilson [ 07/Oct/13 ]

The attempt to resolve a hostname is probably due to the fact that you are using a replica set. I'll bet that your replSetConfig (from mongo shell: > rs.config()), contains the canonical hostnames and not the ip addresses. As such, even if you initially specify an ip address, we are going to use whatever is in the replica set config to contact the members of the replica set. This is simply how it works and all drivers do this.

So, what will happen is this:

1) We start connect using the ip addresses you provide on the seed list.
2) We have at least 1 connection open to each server that was created this way.
3) We send command

{ isMaster : 1 }

and discover that none of the ip addresses on the seed list are in the isMaster response.
4) We remove all the servers and replace them with the ones specified in the connection string
4a) The one caveat is that we change the address of the server who just received the reply because isMaster contains a "me" field, so we are able to transparently correlate the ip with the hostname. We are planning on stopping doing this one in future versions of the driver as it is just confusing and doesn't really provide much value.
5) All new connections use the hostnames.

If the hostnames can't resolve, then you might have been able to connect and even do a couple of operations before things go wonky.

Comment by Karl Stoney [ 07/Oct/13 ]

Irritatingly i'm now unable to reproduce it.... yay for intermittent errors

Comment by Karl Stoney [ 07/Oct/13 ]

What i find interesting is that it's trying to resolve a hostname, even though my connection string is specifying an IP.

Comment by Karl Stoney [ 07/Oct/13 ]

Hi mate,
Version 1.8.3
Here's an example call stack:
Add me on Skype some time it's karl.stoney

[SocketException (0x2afc): The requested name is valid, but no data of the requested type was found]
System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6) +8777373
System.Net.Dns.GetHostAddresses(String hostNameOrAddress) +366
MongoDB.Driver.MongoServerAddress.ToIPEndPoint(AddressFamily addressFamily) +21
MongoDB.Driver.MongoServerInstance.GetIPEndPoint() +70
MongoDB.Driver.Internal.MongoConnection.Open() +65
MongoDB.Driver.Internal.MongoConnection.SendMessage(BsonBuffer buffer, Int32 requestId) +406
MongoDB.Driver.Internal.MongoConnection.SendMessage(MongoRequestMessage message) +310
MongoDB.Driver.Operations.CommandOperation`1.Execute(MongoConnection connection) +228
MongoDB.Driver.MongoServerInstance.Ping(MongoConnection connection) +384
MongoDB.Driver.MongoServerInstance.Connect() +681
MongoDB.Driver.Internal.DirectMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference) +330

[MongoConnectionException: Unable to connect to server IP:PORT: The requested name is valid, but no data of the requested type was found.]
MongoDB.Driver.Internal.DirectMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference) +887
MongoDB.Driver.Internal.DirectMongoServerProxy.ChooseServerInstance(ReadPreference readPreference) +85
MongoDB.Driver.MongoServer.AcquireConnection(ReadPreference readPreference) +299
MongoDB.Driver.MongoDatabase.RunCommandAs(IMongoCommand command, IBsonSerializer resultSerializer, IBsonSerializationOptions resultSerializationOptions) +603
MongoDB.Driver.MongoDatabase.RunCommandAs(IMongoCommand command) +122
MongoDB.Driver.MongoCursor.Size() +339
MongoDB.Driver.Linq.SelectQuery.<TranslateCount>b__2(IEnumerable source) +50
MongoDB.Driver.Linq.SelectQuery.Execute() +1529
MongoDB.Driver.Linq.MongoQueryProvider.Execute(Expression expression) +101

Comment by Craig Wilson [ 07/Oct/13 ]

Hi Karl,
Having a video session would probably be a great help. Just to confirm, are you seeing this problem with the message "The requested name is valid, but no data of the requested type was found." Also, what version of the driver are you using.

Thanks.

Comment by Karl Stoney [ 05/Oct/13 ]

Hi,
If it helps - I can pretty much reproduce this on demand.
I have a job (on azure) that does a large amount of writes (maybe, 15-20 per second) to mongo. If I have a replicate set enabled, i'm guaranteed to get this error within a minute or two, turning the replicaset off gets rid of that. This process is a command line console application - so no IIS app pool malarky going on.

To try and rule out dns issues i've also used an ip in my rs configuration but to no avail.

If there's you'd like from me to help you get to the bottom of this issue, just let me know (happy to even share my screen over skype or something if it helps) - just drop me an email.

Karl

Comment by Craig Wilson [ 04/Oct/13 ]

Sorry, but no. We've attempted to reproduce this and are thus far unable to. There seems to, as I mentioned prior, to be something going on with DNS. As I mentioned before, a couple of things would help us out:

  1. When this happens the next time, instead of recycling the IIS app pool, attempt to do an ipconfig /flushdns. In theory, this will cause the system DNS to refresh itself and maybe get fix the "requested name is valid..." issue.
  2. Enable network tracing and see if that points out anything going on in the networking stack. As this error ultimately emanates from outside the driver and then flows through, it'd be extremely helpful to see what is going on down there.
Comment by Stefano Ricciardi [ 04/Oct/13 ]

Craig,

any news on this issue?

Comment by Stefano Ricciardi [ 23/Sep/13 ]

Craig,

the deployment is composed by 4 web applications running on IIS hitting a Mongo replica-set comprising 2 Linux boxes on CentOs plus 1arbiter on Windows. All boxes run on Rackspace managed hosting. The guys from operations confirmed we use A records.

We have switched to C# driver 1.8.1 more one month ago, and since then we had to restart the application pool on each machine at least once since they would not connect to the replica set anymore after some network glitch (sometimes it took us many hours to realize that we had a problem with one of the machines, since our monitoring service reported both Mongo and the web applications as running). Sometimes we have issues on more than one machine at the same time, most of the times the issue is isolated to one machine only.

Ironically enough we do have tons of application logs on Mongo itseif - written by the log4net Mongo appender - saying that the application cannot connect to Mongo: that is, log4net is somehow able to connect to Mongo while the rest of our application cannot. The connection string we have used for log4net (probably due to a copy and paste error) points to the same members but hasn't got the "replicaSet" parameter in it. Does the C# driver treat the two connection strings any different (other than the timeout and pool size parameters)?

Just to make the difference more clear:

  • Application data connection string: mongodb://host1:27017,host2:27017,arbiter:27017/?replicaSet=ump_replSet;socketTimeout=120000ms;maxPoolSize=200
  • Log4net Mongo appender connection string: mongodb://host1:27017,host2:27017,arbiter:27017/logs

Another difference might also be in the way we create database instances in the application code as opposed to log4net:

  • Application: here we have a singleton MongoDatabase object ("DB") that we pass around in the application to the repositories that need to interact with Mongo.

    var mongoUrlCnString = new MongoUrl(connectionString);
    var client = new MongoClient(mongoUrlCnString);
    MongoServer server = client.GetServer();
    DB = server.GetDatabase(dbName);

  • log4net: the appender seems to create a new MongoDatabase instance via the following function that is invoked every time a new log line is ready to be written:

    private MongoDatabase GetDatabase()
    {
    MongoUrl url = MongoUrl.Create(ConnectionString);
    MongoServer conn = MongoServer.Create(url);
    MongoDatabase db = conn.GetDatabase(url.DatabaseName ?? "log4net");
    return db;
    }

Comment by Craig Wilson [ 20/Sep/13 ]

The alternating makes sense. Basically, what has happened is that we have attempted to open a new connection to the Primary, but Dns has failed to resolve the primaries ip address by the hostname we have on file. At that point, you'll get the SocketException. Then, in the period between our heartbeats (attempts to reconnect), other threads won't be able to get a connection to the primary. So, the alternating makes sense.

I have some follow-up questions:

  1. Are the hosts in Amazon or Azure, or are they local?
  2. When this happens on one box, does it happen on others as well?

The problem here (and in looking back at CSHARP-754 as well), there seems to be an issue with DNS. Have you guys done anything special with your DNS server? Are you using A records or CNAMEs for these hosts? The .NET DNS classes are simply going to call into the Win32 API, and windows has a TTL setting for DNS lookups. When/if this happens again, instead of restarting the application pool, try to flush the dns cache instead. You can do this with

ipconfig /flushdns

.

Also, if you wish, you can enable network tracing and see if that points out anything going on in the networking stack.

Generated at Wed Feb 07 21:37:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.