-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 2.2.30
-
Component/s: MongoDB 3.4
-
Environment:Tested against Node driver 2.2.31 with Node 8.5.0 and both MongoDB 3.0.11 and 3.4.9.
-
Empty show more show less
This is basically a continuation of NODE-1039.
We've been investigating user reports that Meteor's oplog tailing operation can
stall. I've tracked it down to a race condition in the Node Mongo driver.
There honestly might be more than one underlying bug, but this one appears to be
the culprit at least some of the time.
My understanding of the issue: It is possible for Pool's availableConnections
list to accidentally contain duplicates. This then leads to the socketCount() in
connectionFailureHandler to stay at 1 even after the last connection has been
removed, because of its duplicate. This means that the server and pool do not
got destroyed.
This has a couple of effects.
One of them is that, if messages end up in the 'queue' of the pool, they will
never trigger the NODE-1039 fix (which is itself buggy due to referencing a
nonexistent variable).
Another effect is that the `self.topology.isConnected` check in cursor.js's
`nextFunction` can return true even when we actually closed our last connection
to a server. That means that the cursor fails to take advantage of the
disconnectHandler and instead tries to send the query immediately on the server
which is actually closed but whose state is somewhat corrupted.
In practice the latter seems to be the most directly applicable bug, because
fixing this issue means that the messages in question never end up on the Pool's
`queue` in the first place.
How does this availableConnections duplicate situation occur? Honestly there may
be multiple ways, as the list is manipulated in a bunch of places with no
duplicate checks.
The one I found is: if there is an in-flight replset `pingServer` ismaster
command when MongoClient calls Pool.prototype.auth, then its connection will be in
`inUseConnections` rather than `availableConnections`.
`authenticateLiveConnections` will *not* clear that list, and so when its
response comes back, `authenticateStragglers` will move the connection from
`inUseConnections` to `availableConnections`. Then when
`authenticateLiveConnections` has successfully authed the connection, its concat
will get the connection in there twice.
I have a complete reproduction available at https://github.com/glasser/mongo-driver-race-condition
I will send in a PR with the fix for this particular issue. In general, though, the management of the connection lists in pool.js could use extra error handling around avoiding duplicates. I didn't do that, primarily because I'm not sure what the error handling should look like — crash the app so it's easier to find other bugs? Just avoid adding the duplicate in the first place silently? The latter is simpler but it might mask deeper logic bugs if the expectation was that duplicates weren't possible.