Uploaded image for project: 'Node.js Driver'
  1. Node.js Driver
  2. NODE-1153

Pool connection state can be corrupted, leading to query callbacks that never get called

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.0
    • Affects Version/s: 2.2.30
    • Component/s: MongoDB 3.4
    • Environment:
      Tested against Node driver 2.2.31 with Node 8.5.0 and both MongoDB 3.0.11 and 3.4.9.

      This is basically a continuation of NODE-1039.

      We've been investigating user reports that Meteor's oplog tailing operation can
      stall. I've tracked it down to a race condition in the Node Mongo driver.

      There honestly might be more than one underlying bug, but this one appears to be
      the culprit at least some of the time.

      My understanding of the issue: It is possible for Pool's availableConnections
      list to accidentally contain duplicates. This then leads to the socketCount() in
      connectionFailureHandler to stay at 1 even after the last connection has been
      removed, because of its duplicate. This means that the server and pool do not
      got destroyed.

      This has a couple of effects.

      One of them is that, if messages end up in the 'queue' of the pool, they will
      never trigger the NODE-1039 fix (which is itself buggy due to referencing a
      nonexistent variable).

      Another effect is that the `self.topology.isConnected` check in cursor.js's
      `nextFunction` can return true even when we actually closed our last connection
      to a server. That means that the cursor fails to take advantage of the
      disconnectHandler and instead tries to send the query immediately on the server
      which is actually closed but whose state is somewhat corrupted.

      In practice the latter seems to be the most directly applicable bug, because
      fixing this issue means that the messages in question never end up on the Pool's
      `queue` in the first place.

      How does this availableConnections duplicate situation occur? Honestly there may
      be multiple ways, as the list is manipulated in a bunch of places with no
      duplicate checks.

      The one I found is: if there is an in-flight replset `pingServer` ismaster
      command when MongoClient calls Pool.prototype.auth, then its connection will be in
      `inUseConnections` rather than `availableConnections`.
      `authenticateLiveConnections` will *not* clear that list, and so when its
      response comes back, `authenticateStragglers` will move the connection from
      `inUseConnections` to `availableConnections`. Then when
      `authenticateLiveConnections` has successfully authed the connection, its concat
      will get the connection in there twice.

      I have a complete reproduction available at https://github.com/glasser/mongo-driver-race-condition

      I will send in a PR with the fix for this particular issue. In general, though, the management of the connection lists in pool.js could use extra error handling around avoiding duplicates. I didn't do that, primarily because I'm not sure what the error handling should look like — crash the app so it's easier to find other bugs? Just avoid adding the duplicate in the first place silently? The latter is simpler but it might mask deeper logic bugs if the expectation was that duplicates weren't possible.

            Assignee:
            matt.broadstone@mongodb.com Matt Broadstone
            Reporter:
            glasser David Glasser
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: