[SERVER-48506] Throw MaxTimeMSExpired instead of FailedToSatisfyReadPreference when RSM deadline is less than max Created: 29/May/20  Updated: 29/Oct/23  Resolved: 09/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Haley Connelly Assignee: Janna Golden
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-07-13, Sharding 2020-06-29
Participants:
Linked BF Score: 16

 Description   

In server_status_with_timeout_cursors.js, maxTimeMS is set small enough so that it's likely the cursor will time out over its lifetime. It executes find() operations that are expected to fail due to maxTimeMS timeout.

We should change RemoteCommandTargeterRs so that it throws MaxTimeMsExpired rather than FailedToSatisfyReadPreference if the remaining maxTimeMs is less than the RSM deadline to find a host to expose a more accurate error to the user.



 Comments   
Comment by Githook User [ 04/Aug/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-48506 Throw MaxTimeMSExpired instead of FailedToSatisfyReadPreference when RSM deadline is less than max

(cherry picked from commit c8e9a31cd3c21b7b40864e39323fbf0823f79f61)
Branch: v4.4
https://github.com/mongodb/mongo/commit/c44a899c302311196261a4196ba4a53beae1e89c

Comment by Githook User [ 09/Jul/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-48506 Throw MaxTimeMSExpired instead of FailedToSatisfyReadPreference when RSM deadline is less than max
Branch: master
https://github.com/mongodb/mongo/commit/c8e9a31cd3c21b7b40864e39323fbf0823f79f61

Comment by James Wahlin [ 15/Jun/20 ]

The error code added under SERVER-46225 was a temporary measure which has since been removed. That said, if it makes sense we could definitely add "FailedToSatisfyReadPreference " to this list. I wonder however whether we should be returning "MaxTimeMSExpired" for this use case? If I understand correctly, the timeout occurs because an isMaster call from mongos to the CSRS set takes longer than the maxTimeMS allotted to the query. As a user, receiving "FailedToSatisfyReadPreference" could be misleading and lead to the expectation that there is an unhealthy replica set in the cluster. Receiving "MaxTimeMSExpired" allows the user instead to decide whether to retry an operation, potentially with a larger maxTimeMS.

Comment by Janna Golden [ 11/Jun/20 ]

I didn't realize when Haley and I had discussed possible solutions that it was not possible to pass parameters at startup for an individual test in our concurrency suites - we can only do so in the suite's yml file. This would mean we would have to override the RSM default refresh period to < 10ms for every suite that this test runs in, which I don't think we want to do (we'll spam servers with too many isMaster responses). The failpoint mentioned needs to be set at startup when the client starts up its RSM. james.wahlin, it looks like you had added another acceptable error code as a part of SERVER-46225. Do you think it would be okay to add "FailedToSatisfyReadPreference" as an acceptable error as well? I think that might be our best bet to fix this BF.

Comment by Haley Connelly [ 29/May/20 ]

One possible solution is to use the failpoint modifyReplicaSetMonitorDefaultRefreshPeriod to override the isMaster frequency for this test.

Possible scenario:
By setting maxTimeMS to be much smaller than the default, the RSM gets NetworkInterfaceExceededTimeLimit from a non-monitoring connection (the query). It marks the node as unknown and changes the topology of the replica set to ReplicaSetNoPrimary. However, since the timeout was from a non-monitoring connection, the connection to the node that timed doesn't get closed. Thus, there is still an outstanding isMaster request scheduled up to heartbeatFrequency in the future. The RSM's view of the replica set / primary node is not updated until then.

Meanwhile, with the small maxTimeMS and the RSM thinking there is no primary for the replica set, no host can be found in time for the next query and the RSM reports failedToSatisfyReadPreference.

Generated at Thu Feb 08 05:17:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.