Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Internal Code
Labels:
- sa-backlog

Assigned Teams:

Server Programmability
Operating System:
ALL
Steps To Reproduce:
Hide

Start a single shard cluster, with 3 config servers (1 primary and 2 secondaries).

Run fsyncLock against one of the config secondaries.

Connect to the mongos and run a simple find command (e.g. db.test.find({})).

Keep running the command until it blocks for a few seconds and then times out.

I have been able to reliably reproduce this using v5.0.20.
Show
Start a single shard cluster, with 3 config servers (1 primary and 2 secondaries). Run fsyncLock against one of the config secondaries. Connect to the mongos and run a simple find command (e.g. db.test.find({}) ). Keep running the command until it blocks for a few seconds and then times out. I have been able to reliably reproduce this using v5.0.20.

Our internal client implementation uses SDAM, in particular its server selection logic, to target members of replica-sets in a sharded cluster. Given a set of requirements (e.g. read concern and preference), the selection algorithm will provide a list of eligible servers that can be targeted, excluding servers in an Unknown state or with an RTT that exceeds a configurable window.

Once SDAM is notified about a failed remote operation, it tags the corresponding server by updating its state to Unknown, so that it cannot be targeted until further notice. This further notice is provided by a observing a successful hello response from the tagged server. This would result in updating the state of the server (e.g. from Unknown to Secondary), but its RTT is set to HelloRTT::max() to make sure it's not targeted until receiving subsequent hello responses. RSM will also periodically query the remote servers and update RTTs with observed round-trip-times for running hello commands against those servers.

So far, everything works as expected. However, if the remote server cannot run any CRUD operations (e.g. due to receiving fsyncLock), but is capable of running hello commands, the server selection will still find that server as an eligible target. Let's consider the following example:

A mongos server queries the server selection algorithm, and receives S1 as an eligible target. S1, however, has received fsyncLock and is not capable of running CRUD operations.
Once the query times out, mongos will notify SDAM and tag S1 with Unknown. For now, the server selection algorithm will no longer return S1 as a candidate for running remote commands.
SDAM continues to monitor this server, and is able to run a successful hello against the server, updating its state to Secondary but keeping its RTT to HelloRTT::max().
Meanwhile, RSM is monitoring S1 and will update its RTT to a valid value, so it becomes eligible for selection agian.
mongos tries to run another CRUD operation, and is provided with S1 with an eligible target.

This ticket is a place-holder for investigating this issue and proposing a fix.

Assignee:: Unassigned

Reporter:: Amirsaman Memaripour

Participants:: Amirsaman Memaripour

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: Jan 11 2024 04:39:05 PM UTC

Updated:: Oct 23 2024 03:48:45 PM UTC

Details

Description

Attachments

Activity

People

Dates