Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 3.0.0
Affects Version/s: 2.2.30
Component/s: MongoDB 3.4
Labels:
- external-user
Environment:
Tested against Node driver 2.2.31 with Node 8.5.0 and both MongoDB 3.0.11 and 3.4.9.

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

This is basically a continuation of ~~NODE-1039~~.

We've been investigating user reports that Meteor's oplog tailing operation can
stall. I've tracked it down to a race condition in the Node Mongo driver.

There honestly might be more than one underlying bug, but this one appears to be
the culprit at least some of the time.

My understanding of the issue: It is possible for Pool's availableConnections
list to accidentally contain duplicates. This then leads to the socketCount() in
connectionFailureHandler to stay at 1 even after the last connection has been
removed, because of its duplicate. This means that the server and pool do not
got destroyed.

This has a couple of effects.

One of them is that, if messages end up in the 'queue' of the pool, they will
never trigger the ~~NODE-1039~~ fix (which is itself buggy due to referencing a
nonexistent variable).

Another effect is that the `self.topology.isConnected` check in cursor.js's
`nextFunction` can return true even when we actually closed our last connection
to a server. That means that the cursor fails to take advantage of the
disconnectHandler and instead tries to send the query immediately on the server
which is actually closed but whose state is somewhat corrupted.

In practice the latter seems to be the most directly applicable bug, because
fixing this issue means that the messages in question never end up on the Pool's
`queue` in the first place.

How does this availableConnections duplicate situation occur? Honestly there may
be multiple ways, as the list is manipulated in a bunch of places with no
duplicate checks.

The one I found is: if there is an in-flight replset `pingServer` ismaster
command when MongoClient calls Pool.prototype.auth, then its connection will be in
`inUseConnections` rather than `availableConnections`.
`authenticateLiveConnections` will *not* clear that list, and so when its
response comes back, `authenticateStragglers` will move the connection from
`inUseConnections` to `availableConnections`. Then when
`authenticateLiveConnections` has successfully authed the connection, its concat
will get the connection in there twice.

I have a complete reproduction available at https://github.com/glasser/mongo-driver-race-condition

I will send in a PR with the fix for this particular issue. In general, though, the management of the connection lists in pool.js could use extra error handling around avoiding duplicates. I didn't do that, primarily because I'm not sure what the error handling should look like — crash the app so it's easier to find other bugs? Just avoid adding the duplicate in the first place silently? The latter is simpler but it might mask deeper logic bugs if the expectation was that duplicates weren't possible.

Assignee:: Matt Broadstone
Reporter:: David Glasser
Reviewers:: None
Votes:: 3 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Oct 05 2017 01:26:26 AM UTC
Updated:: Oct 29 2023 01:44:20 PM UTC
Resolved:: Jan 31 2019 07:59:31 PM UTC

Details

Description

Attachments

Activity

People

Dates