[SERVER-43155] Queries which exceed maxTimeMS may return NetworkInterfaceExceededTimeLimit Created: 04/Sep/19  Updated: 29/Oct/23  Resolved: 18/Jul/22

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 5.0.11, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Ian Boros Assignee: Backlog - Service Architecture
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Issue split
split to SERVER-66260 Modify executor::RemoteCommand object... Closed
Related
related to SERVER-66162 ConnectionPool getConnection may shor... Open
related to SERVER-79888 find/getMore timeouts triggered by Fe... In Progress
related to SERVER-63181 Make ConnectionPool::SpecificPool::up... Closed
related to SERVER-65472 Relax time out error code in timeseri... Closed
related to SERVER-67722 Shard cursor is not killed on MaxTime... Closed
is related to SERVER-63181 Make ConnectionPool::SpecificPool::up... Closed
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0
Sprint: Service Arch 2022-04-18, Service Arch 2022-05-02, Service Arch 2022-05-16, Service Arch 2022-05-30, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11
Participants:
Linked BF Score: 17

 Description   

In shard_remote, the maxTimeMS of a query is used as a request timeout. If the time limit is exceeded by the network layer, "NetworkInterfaceExeededTimeLimit" will be returned instead of MaxTimeMSExpired. (e.g. here, here).

 



 Comments   
Comment by Githook User [ 01/Aug/22 ]

Author:

{'name': 'Daniel Morilha', 'email': 'daniel.morilha@mongodb.com', 'username': 'daniel-mdb'}

Message: SERVER-43155 Queries which exceed `maxTimeMS` may return `NetworkInterfaceExceededTimeLimit`

After careful evaluation and research the service-arch team settled into
adding a new argument into connection pool's interface allowing their
users to specify a custom timeout status in case the pool fails to
deliver a connection within a specified time.

This timeout status now competes with `NetworkInterfaceExceededTimeLimit`
which is now returned either when the Connection Pool Controller is rather
used to compute the timeout or when no custom timeout status code is
specified.

This commit also:

  • Refactors portions of `ConnectionPoolTest` to reflect changes in the
    API.
  • Adds `jstests/sharding/max_time_ms_connection_pool.js` integration
    test to ensure different timeouts return different statuses.
  • Fixes small glitch in `jstests/core/tnxs/many_txns.js` where
    `NetworkInterfaceExceededTimeLimit` might be validly returned under rare
    resource intense conditions and fail the test, reported as BF-20554.

(cherry picked from commit 3fd2a61085f8c72a13d1bc00ad95c460c69eb8a8)
Branch: v5.0
https://github.com/mongodb/mongo/commit/81cd7f7f3d69937a7c7b2a9ee58ac326ce059fe3

Comment by Githook User [ 27/Jun/22 ]

Author:

{'name': 'Varun Ravichandran', 'email': 'varun.ravichandran@mongodb.com', 'username': 'varunravi98'}

Message: SERVER-43155 Queries which exceed `maxTimeMS` may return `NetworkInterfaceExceededTimeLimit`

After careful evaluation and research the service-arch team settled into
adding a new argument into connection pool's interface allowing their
users to specify a custom timeout status in case the pool fails to
deliver a connection within a specified time.

This timeout status now competes with `NetworkInterfaceExceededTimeLimit`
which is now returned either when the Connection Pool Controller is rather
used to compute the timeout or when no custom timeout status code is
specified.

This commit also:

  • Refactors portions of `ConnectionPoolTest` to reflect changes in the
    API.
  • Adds `jstests/sharding/max_time_ms_connection_pool.js` integration
    test to ensure different timeouts return different statuses.
  • Fixes small glitch in `jstests/core/tnxs/many_txns.js` where
    `NetworkInterfaceExceededTimeLimit` might be validly returned under rare
    resource intense conditions and fail the test, reported as BF-20554.
    Branch: master
    https://github.com/mongodb/mongo/commit/3fd2a61085f8c72a13d1bc00ad95c460c69eb8a8
Comment by Daniel Morilha (Inactive) [ 20/May/22 ]

After an hour-long discussion with different stakeholders (reviewers to the proposed change) requirements change and so did my understanding of what is expected from this ticket and where to draw the boundary line in terms of responsibilities.

George pointed out for the existence of RemoteCommandRequest::timeoutCode and how that is used in order to override the expected user-defined timeout. While I've noticed that it never occurred to me that that would be the case for this given instance. It was then agreed ConnectionPool is ambiguous as it may override the user-defined timeout duration based on internal heuristics still returning the same error code, namely "NetworkInterfaceTimeLimitException" and how that should be addressed by both ConnectionPool and NetworkInterfaceTL appropriately, the user-specified exception should be returned iff the user-specified deadline expired.

It was also reiterated that given errors are returned to the original remote command requester, the requester might be at a better place to treat those errors as they please as moving this sort of logic into parts that are supposed to be generic defeats the very same genericity.

It was also evaluated that the errors occurring in BF-20554 while seem to overlap with this request have nothing to deal with what has been asked here. There, the problem seems to be brittle tests which rely on a specific error being reported when transactions conflict may fail while running into a resource constrained environment. Another specific change is expected for that.

We also had a chance to evaluate the feseability to retire OperationContext --> storeMaxTimeMS, restoreMaxTimeMS and {}maxTime{_} as it may not being used any longer.

Code changes will follow...

Comment by Shane Harvey [ 05/May/22 ]

max.hirschhorn@mongodb.com brought to my attention some subtlety that I missed in the previous question. If the client sends maxTimeMS 60 and the server only waits 30 seconds before returning MaxTimeMSExpired, that behavior seems unexpected. In this case it should be fine to return a non-MaxTimeMSExpired error code since the time limit hasn’t actually expired. Ideally MaxTimeMSExpired would only be returned when maxTimeMS has actually expired.

Note that the opposite (ie "when maxTimeMS expires the server always returns MaxTimeMSExpired") isn't necessarily true. It's expected that the server may return other error codes. For example the server could be in the process of returning a shutdown error when maxTimeMS expires in which case returning ShutdownInProgress is fine.

Comment by Shane Harvey [ 29/Apr/22 ]

It would be great for the server to return MaxTimeMSExpired instead of NetworkInterfaceExceededTimeLimit. Clients expect the MaxTimeMSExpired error to be returned when a request exceeds MaxTimeMS.

Comment by Daniel Morilha (Inactive) [ 29/Apr/22 ]

Hello shane.harvey@mongodb.com, after discussing this SERVER ticket with both Max and Sam we've identified two scenarios which might concern drivers, they are:

  • if a command sets MaxTimeMS of 60 seconds, the connection pool might return NetworkInterfaceExceededTimeLimit
  • if a command sets MaxTimeMS of 60 seconds, but the connection pool has a refreshTimeout of 30, it might return the same NetworkInterfaceExceededTimeLimit.

This exception only tells a connection wasn't available within the requested timeframe. The underlying reason gets hidden away from the requester and we believe it should be re-tried.

The easiest approach would be to have it rewritten to MaxTimeMSExpired at either the client or the sharding layer.

I hope you may answer in terms of the client expectation, if it would make sense to rewrite one exception into the other, and if so, if that could be done even if the specified timeout was shorter than what the command specified or if the executor should add a retry logic itself.

Please let me know if all of that makes sense,

Comment by Daniel Morilha (Inactive) [ 26/Apr/22 ]

After discussing this with amirsaman.memaripour@mongodb.com, MaxTimeMSExpired only relates with client operations.

  • executor
  • mongo::executor::ConnectionPool::SpecificPool::getConnection
  • mongo::executor::ConnectionPool::get
  • mongo::executor::NetworkInterfaceTL::startCommand
  • mongo::executor::ThreadPoolTaskExecutor::scheduleRemoteCommandOnAny
  • mongo::executor::ShardingTaskExecutor::scheduleRemoteCommandOnAny
  • mongo::executor::TaskExecutor::scheduleRemoteCommand
  • client
  • mongo::RemoteCommandRetryScheduler::_schedule_inlock
  • mongo::RemoteCommandRetryScheduler::startup
  • mongo::Fetcher::schedule
  • shard/client
  • mongo::ShardRemote::_runExhaustiveCursorCommand

Based on the above stack trace I think the exception should go on RemoteCommandRetryScheduler which is a client's concern.

Comment by Daniel Morilha (Inactive) [ 12/Apr/22 ]

PR : https://github.com/10gen/mongo/pull/4597

Comment by Rui Liu [ 12/Apr/22 ]

I have recently seen BF-24814 that seems to be related to this. It happens on the

{taskExecutorPoolSize: 4}

build variant. I will temporarily allow the NetworkInterfaceExceededTimeLimit code and leave a TODO for this ticket.

Comment by Daniel Morilha (Inactive) [ 11/Apr/22 ]

SERVER-43155 tells a compelling story for BF-20554. The BF which stills
occurs for both 5.0 and at the tip of the repository, triggers
`deferredStateUpdateFunc`, a lambda inside
`src/mongo/executor/connection_pool.cpp` which sweeps through `requests`
checking their `Date_t` expiration time and possibly assigning the
`NetworkInterfaceExceededTimeLimit` error to its associated promise.

The error surfaces while running `jstests/core/txns/many_txns.js` on a
`rhel80-small` instance. The test, which runs concurrently with other tests,
stresses the host by committing many parallel machines, and then querying
for them.

Giving the load, query commands may exceed their `maxTimeMS` and may get
`NetworkInterfaceExceedTimeLimit` mistakenly assigned as a result.

It is worth mentioning this could not be reproduced locally.

Comment by Max Hirschhorn [ 08/Apr/22 ]

Reopening this ticket because the MaxTimeMSExpired error code (50) is semantically meaningful to drivers (e.g. ExecutionTimeout exception type, behavior of adding label UnknownTransactionCommitResult to commitTransaction with maxTimeMS) and must be used for errors related to using maxTimeMS. See also SERVER-35031.

Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ]

We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by David Storch [ 09/Sep/19 ]

The query team believes that this should be triaged by the service architecture team.

Generated at Thu Feb 08 05:02:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.