[CSHARP-187] BsonBuffer waiting indefinitely on broken connection Created: 31/Mar/11  Updated: 02/Apr/15  Resolved: 01/Apr/11

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.0
Fix Version/s: 1.1

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

BsonBuffer.LoadFrom contains the following loop:

while (bytesPending > 0) {
var bytesRead = stream.Read(localChunk, localChunkOffset, bytesPending);
if (bytesRead == 0)

{ // TODO: timeout? Thread.Sleep(5); // just enough to not be busy waiting }

else

{ localChunkOffset += bytesRead; bytesPending -= bytesRead; }

}

The "timeout" is a necessity here, since when the connection is gracefully closed from the MongoDB side (by stopping the server), the client connection remains active:
netstat -a -n | find ":27017"
TCP client:5171 mongodb:27017 CLOSE_WAIT

This leads to an endless loop that just hangs the client thread when the server is gone.



 Comments   
Comment by Aristarkh Zagorodnikov [ 05/Apr/11 ]

Thanks for the fix, I'll put it through some heavy testing (multiple occasionally crashing servers in a cluster) in a few days.

Comment by Robert Stam [ 05/Apr/11 ]

Pushed a new fix to this bug based on Aristarkh's research. We are now treating a return value of zero from NetworkStream.Read as end of stream with no timeout needed.

Comment by Aristarkh Zagorodnikov [ 02/Apr/11 ]

I checked the Mono project docs for NetworkStream.Read() also (http://www.go-mono.com/docs/monodoc.ashx?link=M%3aSystem.Net.Sockets.NetworkStream.Read(System.Byte%5b%5d%2cSystem.Int32%2cSystem.Int32))
"When no incoming data is available, this method blocks and waits for data to arrive.
If the remote socket was shut down gracefully (Socket.Shutdown(SocketShutdown) was called on the socket or the SocketOptionName.Linger option was enabled and Socket.Close was called on the socket) and all data was received, this method immediately returns zero."

So it appears that Mono handles it in the same way I described above:

Comment by Aristarkh Zagorodnikov [ 02/Apr/11 ]

I did a small research and it appears that timeout might be unnecessary altogether.

The documentation on NetworkStream.Read (http://msdn.microsoft.com/en-us/library/xxst1299.aspx) says:
"If no data is available for reading, the Read method returns 0. The Read operation reads as much data as is available, up to the number of bytes specified by the size parameter. If the remote host shuts down the connection, and all available data has been received, the Read method completes immediately and return zero bytes."

It looks like "If no data is available for reading, the Read method returns 0" part is plainly wrong, check this link: http://social.msdn.microsoft.com/forums/en-US/netfxnetcom/thread/6dbd1467-6a90-43bb-abaa-df5ca5a1cd85

I used .NET Reflector to check out how NetworkStream.Read() is implemented:
public override int NetworkStream.Read([In, Out] byte[] buffer, int offset, int size)
{
... various parameter checks go here, only exceptions are thrown ...
try

{ num2 = streamSocket.Receive(buffer, offset, size, SocketFlags.None); }

catch (Exception exception)
{
if (((exception is ThreadAbortException) || (exception is StackOverflowException)) || (exception is OutOfMemoryException))

{ throw; }

throw new IOException(SR.GetString("net_io_readfailure", new object[]

{ exception.Message }

), exception);
}
return num2;
}

So it appears that it's just a wrapper around Socket.Receive(), and its documentation (http://msdn.microsoft.com/en-us/library/w3xtz6a5.aspx) says:
"If you are using a connection-oriented Socket, the Receive method will read as much data as is available, up to the number of bytes specified by the size parameter. If the remote host shuts down the Socket connection with the Shutdown method, and all available data has been received, the Receive method will complete immediately and return zero bytes."
"If no data is available for reading, the Receive method will block until data is available, unless a time-out value was set by using Socket.ReceiveTimeout. If the time-out value was exceeded, the Receive call will throw a SocketException. If you are in non-blocking mode, and there is no data available in the in the protocol stack buffer, the Receive method will complete immediately and throw a SocketException."

Take note that the Socket.Receive() documentation explicitly mentions that it blocks if it can't read at least some of the data, and returns zero only if the remote end is shut down.

So, it appears that there is an old, wrong text in documentation (I believe it wasn't wrong around .NET 1.1, but it changed with 2.0) and Mike Flasko (I think he was System.Net sub-library PM at that time) acknowledges that. It looks like that fix never got in for the last five years though.

I agree that this might need additional testing (implementing this wrong might break existing applications completely, I volunteer to help if you need to repro something), but I think that getting immediate disconnection detection (not mentioning the warming sense of "done right") would be a very good thing to have =)

Comment by Robert Stam [ 01/Apr/11 ]

Fixed. Added timeout to BsonBuffer.LoadFrom.

Comment by Aristarkh Zagorodnikov [ 01/Apr/11 ]

Good to know you got it. I look forward to C# driver being a very robust one since I'm going to put it to heavy use in the near future in an environment where server going down is a norm =)

Comment by Robert Stam [ 01/Apr/11 ]

When running the server on Ubuntu I can sometimes reproduce this. Depends on the exact moment that the server is killed.

1. If the server is killed and the next operation on the client is SendMessage an exception is thrown
2. If the server is killed after the client calls SendMessage and before it calls ReceiveMessage the client hangs

So I guess the timeout is necessary after all. Will put it in.

Thanks for reporting this!

Comment by Robert Stam [ 01/Apr/11 ]

When I attempt to reproduce this I always get an IOException instead of a hung client.

The only difference is that I am running the server on another Windows machine, not Ubuntu. Will setup an Ubuntu environment to test with.

Unhandled exception:
System.IO.IOException: Unable to write data to the transport connection: An exis
ting connection was forcibly closed by the remote host. ---> System.Net.Sockets.
SocketException: An existing connection was forcibly closed by the remote host
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, So
cketFlags socketFlags)
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32
size)
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32
size)
at MongoDB.Bson.IO.BsonBuffer.WriteTo(Stream stream) in C:\work\10gen\mongodb
\mongo-csharp-driver\Bson\IO\BsonBuffer.cs:line 761
at MongoDB.Driver.Internal.MongoConnection.SendMessage(MongoRequestMessage me
ssage, SafeMode safeMode) in C:\work\10gen\mongodb\mongo-csharp-driver\Driver\In
ternal\MongoConnection.cs:line 381
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.GetReply(MongoRequestMessa
ge message) in C:\work\10gen\mongodb\mongo-csharp-driver\Driver\Internal\MongoCu
rsorEnumerator.cs:line 201
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.GetMore() in C:\work\10gen
\mongodb\mongo-csharp-driver\Driver\Internal\MongoCursorEnumerator.cs:line 194
at MongoDB.Driver.Internal.MongoCursorEnumerator`1.MoveNext() in C:\work\10ge
n\mongodb\mongo-csharp-driver\Driver\Internal\MongoCursorEnumerator.cs:line 103
at TestBrokenConnection.Program.Main(String[] args) in c:\users\robert stam\d
ocuments\visual studio 2010\Projects\TestBrokenConnection\TestBrokenConnection\P
rogram.cs:line 37
Press Enter to continue

Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ]

Here I allow one test to finish and then stop the database when the 2nd test is in progress.
Sorry for the wall of text, but I think this might be helpful – it's an extract of the log:

Thu Mar 31 12:07:32 [initandlisten] MongoDB starting : pid=2013 port=27017 dbpath=/data/mongodb 64-bit
Thu Mar 31 12:07:32 [initandlisten] db version v1.8.0, pdfile version 4.5
Thu Mar 31 12:07:32 [initandlisten] git version: 9c28b1d608df0ed6ebe791f63682370082da41c0
Thu Mar 31 12:07:32 [initandlisten] build sys info: Linux bs-linux64.10gen.cc 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28
Thu Mar 31 12:07:32 [initandlisten] journal dir=/data/mongodb/journal
Thu Mar 31 12:07:32 [initandlisten] recover : no journal files present, no recovery needed
Thu Mar 31 12:07:32 [initandlisten] waiting for connections on port 27017
Thu Mar 31 12:07:32 [websvr] web admin interface listening on port 28017
Thu Mar 31 12:07:32 [initandlisten] connection accepted from 127.0.0.1:55279 #1
Thu Mar 31 12:07:32 [initandlisten] connection accepted from 192.168.7.243:38367 #2
^^^^ self (celestine-3)

Thu Mar 31 12:07:32 [startReplSets] replSet STARTUP2
Thu Mar 31 12:07:32 [rs Manager] replSet can't see a majority, will not try to elect self
Thu Mar 31 12:07:32 [ReplSetHealthPollTask] replSet info celestine-2 is up
Thu Mar 31 12:07:32 [ReplSetHealthPollTask] replSet member celestine-2 SECONDARY
Thu Mar 31 12:07:32 [replica set sync] replSet SECONDARY
Thu Mar 31 12:07:33 [rs Manager] replSet info electSelf 3
Thu Mar 31 12:07:33 [MultiCommandJob] DBClientCursor::init call() failed
Thu Mar 31 12:07:33 [rs Manager] replSet couldn't elect self, only received -9999 votes
Thu Mar 31 12:07:34 [ReplSetHealthPollTask] replSet info celestine-1 is up
Thu Mar 31 12:07:34 [ReplSetHealthPollTask] replSet member celestine-1 SECONDARY
Thu Mar 31 12:07:35 [initandlisten] connection accepted from 192.168.7.242:47302 #3
^^^^ replication partner (celestine-2)

Thu Mar 31 12:07:35 [initandlisten] connection accepted from 192.168.7.241:55912 #4
^^^^ replication partner (celestine-1)

Thu Mar 31 12:07:35 [rs Manager] replSet info electSelf 3
Thu Mar 31 12:07:35 [rs Manager] replSet PRIMARY
Thu Mar 31 12:07:37 [initandlisten] connection accepted from 192.168.7.241:55914 #5
^^^^ replication partner (celestine-1)

Thu Mar 31 12:07:43 [initandlisten] connection accepted from 192.168.7.44:5395 #6
^^^^ client connected

Thu Mar 31 12:07:43 [conn6] getmore test1.items cid:2776051345393985494 getMore: {} bytes:4194290 nreturned:144630 130ms
^^^^ cursor continuation

Thu Mar 31 12:07:44 [initandlisten] connection accepted from 192.168.7.242:47303 #7
^^^^ replication partner (celestine-2)

Thu Mar 31 12:07:46 [conn6] getmore test1.items cid:2776051345393985494 getMore: {} bytes:3976616 nreturned:137124 129ms
^^^^ cursor continuation

Thu Mar 31 12:07:50 [conn6] end connection 192.168.7.44:5395
^^^^ client disconnected

Thu Mar 31 12:07:51 [initandlisten] connection accepted from 192.168.7.44:5396 #8
^^^^ client connected

Thu Mar 31 12:07:52 [conn8] getmore test1.items cid:5322570835678218172 getMore: {} bytes:4194290 nreturned:144630 109ms
^^^^ cursor continuation

Thu Mar 31 12:07:53 got kill or ctrl c or hup signal 15 (Terminated), will terminate after current cmd ends
^^^ got SIGHUP

Thu Mar 31 12:07:53 [interruptThread] now exiting
Thu Mar 31 12:07:53 dbexit:
Thu Mar 31 12:07:53 [interruptThread] shutdown: going to close listening sockets...
Thu Mar 31 12:07:53 [interruptThread] closing listening socket: 6
Thu Mar 31 12:07:53 [interruptThread] closing listening socket: 7
Thu Mar 31 12:07:53 [interruptThread] closing listening socket: 8
Thu Mar 31 12:07:53 [interruptThread] closing listening socket: 9
Thu Mar 31 12:07:53 [interruptThread] removing socket file: /tmp/mongodb-27017.sock
Thu Mar 31 12:07:53 [interruptThread] removing socket file: /tmp/mongodb-28017.sock
Thu Mar 31 12:07:53 [interruptThread] shutdown: going to flush diaglog...
Thu Mar 31 12:07:53 [interruptThread] shutdown: going to close sockets...
Thu Mar 31 12:07:53 [interruptThread] shutdown: waiting for fs preallocator...
Thu Mar 31 12:07:53 [interruptThread] shutdown: lock for final commit...
Thu Mar 31 12:07:53 [interruptThread] shutdown: final commit...
Thu Mar 31 12:07:53 [interruptThread] shutdown: closing all files...
Thu Mar 31 12:07:53 [conn2] end connection 192.168.7.243:38367
^^^^ self (celestine-3)

Thu Mar 31 12:07:53 [conn1] end connection 127.0.0.1:55279
Thu Mar 31 12:07:53 closeAllFiles() finished
Thu Mar 31 12:07:53 [interruptThread] shutdown: journalCleanup...
Thu Mar 31 12:07:53 [interruptThread] removeJournalFiles
Thu Mar 31 12:07:53 [interruptThread] shutdown: removing fs lock...
Thu Mar 31 12:07:53 dbexit: really exiting now

Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ]

Nope, I used graceful stop using the script that comes with the Ubuntu package: sudo service stop mongodb
I think the script just sends a SIGHUP to the MongoDB process.

Comment by Robert Stam [ 31/Mar/11 ]

OK. So looks like part of the key to reproducing is to make sure the "items" collection has enough items in it that this loop take many seconds to execute, giving time for you to stop the server.

How are you stopping the server? CTL-C?

Thanks.

Comment by Aristarkh Zagorodnikov [ 31/Mar/11 ]

I still use the same test code as before:
static void ReadTest()
{
try

{ const string connectionString = "mongodb://celestine-3/test1?slaveOk=true"; var database = MongoDatabase.Create(MongoUrl.Create(connectionString)); var items = database["items"]; var cursor = items.FindAll(); long sum = 0; foreach (var item in cursor) sum += item["v"].ToInt64(); }

catch (Exception ex)

{ Console.WriteLine(ex.ToString()); }

Console.ReadKey();
}

celestine-3 is running Linux x86-64 (Ubuntu 10.04), with mongodb 1.8.0 release (binary from 10gen ubuntu repo) as a primary (doesn't matter, secondaries fail in the same way, I guess that non-replicated machines would do the same) member of a 3-member replica set.
The client is running Windows 7 x64 SP1, .NET 4.0, build is done under VS 2010 SP1.
The "test1" database contains "items" collection which has several hundred thousand items like {_id: ObjectId, v: some_number}.
I run the ReadTest() method, wait about 3 seconds, then stop the mongodb service on celestine-3. I check the process to be stopped, it's gone, but the connection hangs in the "CLOSE_WAIT" state indefinitely and stream.Read() just returns zero. I left the client machine in this state for about half an hour – when I returned the client was still looping.

Comment by Robert Stam [ 31/Mar/11 ]

Is there an easy way to reproduce this?

Generated at Wed Feb 07 21:36:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.