Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-2552

Race condition causes assert failure after secondary removed from replica set

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.10.0
    • Affects Version/s: 1.5.0
    • Component/s: libmongoc, replset
    • None

      I made a mistake implementing CDRIVER-562, which can result in an assert failure when connected to a replica set. I'm not certain what race condition, or perhaps replica set misconfiguration, can cause it, but I can describe the sequence of events and a fix anyway.

      1. Driver discovers a replica set with a primary A and at least one secondary B
      2. Driver opens a new connection to A (either because of a prior disconnect in single mode, or while expanding the pool in pooled mode)
      3. Driver handshakes the new connection by calling isMaster on A
      4. A, for some reason, suddenly does not include B in its host list, perhaps because B was removed from the configuration, or because of some persistent misconfiguration
      5. The driver marks B as "retired" so it can be removed from the topology description soon (this is my implementation of updateRSFromPrimary in the Server Discovery and Monitoring Spec, with a simplification to avoid a crash, CDRIVER-789).
      6. The driver's connection to B is disconnected somehow.
      7. During the next scan, in mongoc_topology_scanner_node_setup, the driver sees that B is disconnected, and before it begins reconnecting, it asserts that B is not marked "retired"

      The point of the assert was to validate my understanding of the "retired" field. I thought it was always cleared after a scan, I didn't realize the side effect of CDRIVER-562, which I implemented recently, which allowed "retired" to be set before the next scan began.

      The solution is probably, at the beginning of each scan, to remove any nodes that have been retired by handshakes. The assert can remain in place.

      In addition, there had been a bug not only when
      _mongoc_topology_update_from_handshake retires a node, but also when it adds
      one. The newly-added node has a mongoc_async_cmd_t created for it, even though
      it's outside the scanner loop. Therefore in mongoc_topology_scanner_start the
      assert (!node-cmd) fails. This was fixed as a side effect of CDRIVER-1972 during 1.10 development.

      This fix will not be backported.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: