Uploaded image for project: 'Node.js Driver'
  1. Node.js Driver
  2. NODE-1363

Unable to establish a connection to a MongoDB cluster after a connection between Primary and its peers was disrupted and then fixed

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.2
    • Component/s: None
    • Labels:
      None

      Description

      Hi folks,

      I was testing how a 3-nodes cluster of MongoDB behaves (from a client perspective) when one of the nodes is isolated from the cluster and observed an anomaly.

      TL;DR: When a node is isolated, all clients experience a temporary glitch caused by the timeouts and re-election (expected behaviour). After some time all clients continue successfully working with a cluster except the one located on the isolated node (expected behaviour) but when the connection has restored the client on the formerly isolated node is still unable to do progress (it hangs trying to establish a connection).

      Repo with Docker+JavaScript scripts to reproduce the problem: https://github.com/rystsov/perseus/tree/master/mongodb

      Testing environment
      OS: Ubuntu 17.10
      MongoDB: 3.6.1
      NodeJS: v8.9.4
      MongoDB driver "mongodb": 3.0.1

      Three nodes (mongo1, mongo2, mongo3) form a MongoDB cluster running mongod in a single replica set. Besides mongod, each node has a NodeJS app which:

      • in a loop reads a value from the cluster, increments it and writes back
      • exposes API to fetch number of successful and failed iterations

      Also there is a fourth node (client1) which fetches the stat and dumps it every second in the following format:

      #legend: time|mongo1|mongo2|mongo3|mongo1:err|mongo2:err|mongo3:err
      1 14 22 15 0 0 0 2018/01/14 10:53:42
      2 42 43 32 0 0 0 2018/01/14 10:53:43
      3 44 43 31 0 0 0 2018/01/14 10:53:44

      The first column is the number of passed seconds since the beginning of the experiment, the next three columns contain counter of successful iterations of the clients located on each node per last second, the next three columns are counter of errors and the last one is time.

      Client (db/remote-tester)

      Client is a NodeJS app using the following connection URL:

      mongodb://mongo1,mongo2,mongo3/?replicaSet=rs0&autoReconnect=false&socketTimeoutMS=10000&connectTimeoutMS=10000

      Each write operation is performed with writeConcern "majority" while each read operation is with readConcern "linearizable".

      When a client runs into a problem, it closes the current connection and opens a new one.

      Each client uses a different key to avoid collisions.

      Isolation

      The network isolation is simulated with iptables:

      iptables -A INPUT -s p.ee.r.ip -j DROP
      iptables -A OUTPUT -d p.ee.r.ip -j DROP

      During the isolation, the traffic is blocked between mongo1 and (mongo2 & mongo3), but client1 can see all nodes.

      Results

      Let's take a look on the logs of the aggregating client (client1.log):

      #legend: time|mongo1|mongo2|mongo3|mongo1:err|mongo2:err|mongo3:err
      1 14 22 15 0 0 0 2018/01/14 10:53:42
      2 42 43 32 0 0 0 2018/01/14 10:53:43
      3 44 43 31 0 0 0 2018/01/14 10:53:44
      ...
      32 46 40 32 0 0 0 2018/01/14 10:54:13
      33 38 40 33 0 0 0 2018/01/14 10:54:14
      #isolating mongo1
      #isolated mongo1
      34 27 31 29 0 0 0 2018/01/14 10:54:15
      35 0 0 0 0 0 0 2018/01/14 10:54:16
      ...
      62 0 0 0 0 0 0 2018/01/14 10:54:43
      63 0 0 0 0 0 0 2018/01/14 10:54:44
      64 0 6 23 1 1 1 2018/01/14 10:54:45
      65 0 71 76 0 0 0 2018/01/14 10:54:46
      ...
      93 0 69 72 0 0 0 2018/01/14 10:55:14
      94 0 87 70 0 0 0 2018/01/14 10:55:15
      #rejoining mongo1
      #rejoined mongo1
      95 0 71 68 0 0 0 2018/01/14 10:55:16
      96 0 77 84 0 0 0 2018/01/14 10:55:17
      ...
      200 0 73 39 0 0 0 2018/01/14 10:57:02
      201 0 74 47 0 0 0 2018/01/14 10:57:03

      As you can see after the isolation clients on mongo1, mongo2, mongo3 stopped doing progress. 29 seconds later mongo2 and mongo3 recovered, but mongo1 remained silent which is expectable since it wasn't connected to mongo2 and mongo3.

      But even after the connection was restored the mongo1 client still wasn't able to make any progress . At least for the next 106 seconds (after this point I aborted the experiment).

      Logs

      Please see the attached "part-full.tar.gz" archive for logs.

      client1.log

      The log of the aggregating client. Its format was described above.

      mongo(1|2|3).tester.log

      Log of the actual client. In happy case it looks like:

      2018/01/14 10:53:41 reading(0)
      2018/01/14 10:53:41 connecting
      2018/01/14 10:53:41 ok
      2018/01/14 10:53:41 creating(0)
      2018/01/14 10:53:42 ok(0)
      2018/01/14 10:53:42 reading(1)
      2018/01/14 10:53:42 stat
      2018/01/14 10:53:42 updating(1)
      2018/01/14 10:53:42 ok(1)

      Each successful iteration is represented by a sequence of lines (reading, creating|updating, ok). During a connection you'll see "connecting" and "ok" records. "stat" record corresponds to the fetch request from the aggregating client and helps to understand how long the previous operation hangs.

      If there are errors during the execution, then they also are dumped into this log like:

      • "MongoNetworkError: connection 0 to mongo1:27017 timed out"
      • "MongoNetworkError: failed to connect to server mongo2:27017 on first connect"

      mongo(1|2|3).log

      Log of the mongod processes.

      Why do I think it's a bug rather than a misconfiguration?

      The bug isn't always reproducible and sometimes it after the connection is restored the mongo1's client starts working, see the attached "good-full.tar.gz".

      Also, it looks more like the MongoDB's issue than a driver's problem. After I restored the connection between mongo1 and (mongo2 & mongo3) nodes and didn't see any mongo1's improvements, I killed restarted the client (killed -9 it, and let supervisord to revive it) and it still wasn't able to do progress.

      I hope this report will help you!

      Please see the description and the doc here: https://github.com/rystsov/perseus/tree/master/mongodb

        Attachments

        1. good-full.tar.gz
          105 kB
        2. part-full.data.tar.gz
          29.63 MB
        3. part-full.tar.gz
          194 kB

          Issue Links

            Activity

              People

              Assignee:
              matt.broadstone Matt Broadstone
              Reporter:
              rystsov Denis Rystsov
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: