Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-3625

Multi-threaded scanning should scan servers independently

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.17.0-beta2, 1.17.0
    • Affects Version/s: None
    • Component/s: SDAM
    • Labels:
      None

      For a client pool, the background topology scanner runs a complete scan of all servers after heartbeatFrequencyMS has passed (or sooner, if a scan is requested).

      The background scan uses mongoc_topology_scan_once. This fans out "ismaster" commands and waits for all responses before another scan can be scheduled.

      A big problem with this, is that a slow server could block the next scheduled scan of all other servers. The timeout of an "ismaster" in scanning is connectTimeoutMS, which may exceed heartbeatFrequencyMS. This scenario can easily happen:

      1. Scan requested
      2. "ismaster" is sent to servers X and Y
      3. X responds quickly, but Y hangs for connectTimeoutMS.
      4. The "ismaster" to X times out.
      5. The background thread sees that more than heartbeatFrequencyMS has passed and starts a new complete scan.

      I've reproduced this behavior by modifying example-sdam-monitoring.c. It overrides the stream initializer to simulate a slow connection to one server.
      https://gist.github.com/kevinAlbs/1eb3fd42a2b17d71f99e4d9389661069

      Running it against a two node replica set shows the behavior:

      $ example-sdam-monitoring "mongodb://localhost:27017,localhost:27018/?connectTimeoutMS=20000&heartbeatFrequencyMS=1000"
      ...
      2020/04/15 13:37:54.0521: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
      2020/04/15 13:37:54.0524: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat started
      ...
      2020/04/15 13:38:14.0633: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat failed: socket timeout calling ismaster on 'localhost:27018'
      2020/04/15 13:38:15.0137: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
      

      The second heartbeat to localhost:27017 is blocked by the 20 second connection timeout.

      This behavior is unavoidable for single-threaded scans, but should not be the case for multi-threaded scans. Servers should be scanned at their own intervals (which also better aligns with the server monitoring spec).

            Assignee:
            kevin.albertson@mongodb.com Kevin Albertson
            Reporter:
            kevin.albertson@mongodb.com Kevin Albertson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: