Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 1.17.0-beta2, 1.17.0
Affects Version/s: None
Component/s: SDAM
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

For a client pool, the background topology scanner runs a complete scan of all servers after heartbeatFrequencyMS has passed (or sooner, if a scan is requested).

The background scan uses mongoc_topology_scan_once. This fans out "ismaster" commands and waits for all responses before another scan can be scheduled.

A big problem with this, is that a slow server could block the next scheduled scan of all other servers. The timeout of an "ismaster" in scanning is connectTimeoutMS, which may exceed heartbeatFrequencyMS. This scenario can easily happen:

1. Scan requested
2. "ismaster" is sent to servers X and Y
3. X responds quickly, but Y hangs for connectTimeoutMS.
4. The "ismaster" to X times out.
5. The background thread sees that more than heartbeatFrequencyMS has passed and starts a new complete scan.

I've reproduced this behavior by modifying example-sdam-monitoring.c. It overrides the stream initializer to simulate a slow connection to one server.
https://gist.github.com/kevinAlbs/1eb3fd42a2b17d71f99e4d9389661069

Running it against a two node replica set shows the behavior:

$ example-sdam-monitoring "mongodb://localhost:27017,localhost:27018/?connectTimeoutMS=20000&heartbeatFrequencyMS=1000"
...
2020/04/15 13:37:54.0521: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started
2020/04/15 13:37:54.0524: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat started
...
2020/04/15 13:38:14.0633: [78700]:    DEBUG:       mongoc: localhost:27018 heartbeat failed: socket timeout calling ismaster on 'localhost:27018'
2020/04/15 13:38:15.0137: [78700]:    DEBUG:       mongoc: localhost:27017 heartbeat started

The second heartbeat to localhost:27017 is blocked by the 20 second connection timeout.

This behavior is unavoidable for single-threaded scans, but should not be the case for multi-threaded scans. Servers should be scanned at their own intervals (which also better aligns with the server monitoring spec).

is depended on by

CDRIVER-3678 /Topology/request_scan_on_error failing

Closed

CDRIVER-3535 Reduce Client Time To Recovery On Topology Changes

Closed

related to

CDRIVER-3701 Calling topology TRACE macro with no formatted args emits compiler warning

Closed

CDRIVER-3682 Follow-up to thread-per-server monitoring

Backlog

CDRIVER-3722 Update documentation for multi-threaded scanning behavior

Closed

Assignee:: Kevin Albertson
Reporter:: Kevin Albertson
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Apr 15 2020 05:53:40 PM UTC
Updated:: Oct 28 2023 11:28:48 AM UTC
Resolved:: May 29 2020 07:56:50 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates