Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-3625

Multi-threaded scanning should scan servers independently

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.17.0-beta2, 1.17.0
    • Component/s: sdam
    • Labels:
      None

      Description

      For a client pool, the background topology scanner runs a complete scan of all servers after heartbeatFrequencyMS has passed (or sooner, if a scan is requested).

      The background scan uses mongoc_topology_scan_once. This fans out "ismaster" commands and waits for all responses before another scan can be scheduled.

      A big problem with this, is that a slow server could block the next scheduled scan of all other servers. The timeout of an "ismaster" in scanning is connectTimeoutMS, which may exceed heartbeatFrequencyMS. This scenario can easily happen:

      1. Scan requested
      2. "ismaster" is sent to servers X and Y
      3. X responds quickly, but Y hangs for connectTimeoutMS.
      4. The "ismaster" to X times out.
      5. The background thread sees that more than heartbeatFrequencyMS has passed and starts a new complete scan.

      I've reproduced this behavior by modifying example-sdam-monitoring.c. It overrides the stream initializer to simulate a slow connection to one server.
      https://gist.github.com/kevinAlbs/1eb3fd42a2b17d71f99e4d9389661069

      Running it against a two node replica set shows the behavior:

      $ example-sdam-monitoring "mongodb://localhost:27017,localhost:27018/?connectTimeoutMS=20000&heartbeatFrequencyMS=1000"
      ...
      2020/04/15 13:37:54.0521: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
      2020/04/15 13:37:54.0524: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat started
      ...
      2020/04/15 13:38:14.0633: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat failed: socket timeout calling ismaster on 'localhost:27018'
      2020/04/15 13:38:15.0137: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
      

      The second heartbeat to localhost:27017 is blocked by the 20 second connection timeout.

      This behavior is unavoidable for single-threaded scans, but should not be the case for multi-threaded scans. Servers should be scanned at their own intervals (which also better aligns with the server monitoring spec).

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              kevin.albertson Kevin Albertson
              Reporter:
              kevin.albertson Kevin Albertson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: