[CDRIVER-3625] Multi-threaded scanning should scan servers independently Created: 15/Apr/20  Updated: 28/Oct/23  Resolved: 29/May/20

Status: Closed
Project: C Driver
Component/s: SDAM
Affects Version/s: None
Fix Version/s: 1.17.0-beta2, 1.17.0

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Kevin Albertson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by CDRIVER-3678 /Topology/request_scan_on_error failing Closed
is depended on by CDRIVER-3535 Reduce Client Time To Recovery On Top... Closed
Related
related to CDRIVER-3701 Calling topology TRACE macro with no ... Closed
related to CDRIVER-3682 Follow-up to thread-per-server monito... Backlog
related to CDRIVER-3722 Update documentation for multi-thread... Closed

 Description   

For a client pool, the background topology scanner runs a complete scan of all servers after heartbeatFrequencyMS has passed (or sooner, if a scan is requested).

The background scan uses mongoc_topology_scan_once. This fans out "ismaster" commands and waits for all responses before another scan can be scheduled.

A big problem with this, is that a slow server could block the next scheduled scan of all other servers. The timeout of an "ismaster" in scanning is connectTimeoutMS, which may exceed heartbeatFrequencyMS. This scenario can easily happen:

1. Scan requested
2. "ismaster" is sent to servers X and Y
3. X responds quickly, but Y hangs for connectTimeoutMS.
4. The "ismaster" to X times out.
5. The background thread sees that more than heartbeatFrequencyMS has passed and starts a new complete scan.

I've reproduced this behavior by modifying example-sdam-monitoring.c. It overrides the stream initializer to simulate a slow connection to one server.
https://gist.github.com/kevinAlbs/1eb3fd42a2b17d71f99e4d9389661069

Running it against a two node replica set shows the behavior:

$ example-sdam-monitoring "mongodb://localhost:27017,localhost:27018/?connectTimeoutMS=20000&heartbeatFrequencyMS=1000"
...
2020/04/15 13:37:54.0521: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
2020/04/15 13:37:54.0524: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat started
...
2020/04/15 13:38:14.0633: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat failed: socket timeout calling ismaster on 'localhost:27018'
2020/04/15 13:38:15.0137: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started

The second heartbeat to localhost:27017 is blocked by the 20 second connection timeout.

This behavior is unavoidable for single-threaded scans, but should not be the case for multi-threaded scans. Servers should be scanned at their own intervals (which also better aligns with the server monitoring spec).



 Comments   
Comment by Kevin Albertson [ 04/Jun/20 ]

PR: https://github.com/mongodb/mongo-c-driver/pull/607

Comment by Githook User [ 29/May/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3625 monitor with a thread-per-server

For a client pool, servers are now scanned independently using a
separate thread per server.

SRV polling is now done on a separate thread.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/370fc1855145ea06b1586286c99cdfc5d073a6ea

Comment by Githook User [ 29/May/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-3625 monitor with a thread-per-server

For a client pool, servers are now scanned independently using a
separate thread per server.

SRV polling is now done on a separate thread.
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/c62098bb3ad25a1c2dbd199c0137eb293703c876

Comment by Kevin Albertson [ 02/May/20 ]

The following should be done as a part of this ticket or as a separate follow-on ticket:

  • OnceĀ CDRIVER-3614 if resolved, use a separate thread for SRV polling.
  • Remove functionality from the topology scanner that is specific to multi-threaded.
Comment by April Schoffer [ 20/Apr/20 ]

CDRIVER-3535 will resolve this bug.

Generated at Wed Feb 07 21:18:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.