[SERVER-43155] Queries which exceed maxTimeMS may return NetworkInterfaceExceededTimeLimit Created: 04/Sep/19 Updated: 29/Oct/23 Resolved: 18/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.11, 6.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Boros | Assignee: | Backlog - Service Architecture |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Service Arch 2022-04-18, Service Arch 2022-05-02, Service Arch 2022-05-16, Service Arch 2022-05-30, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 17 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In shard_remote, the maxTimeMS of a query is used as a request timeout. If the time limit is exceeded by the network layer, "NetworkInterfaceExeededTimeLimit" will be returned instead of MaxTimeMSExpired. (e.g. here, here).
|
| Comments |
| Comment by Githook User [ 01/Aug/22 ] |
|
Author: {'name': 'Daniel Morilha', 'email': 'daniel.morilha@mongodb.com', 'username': 'daniel-mdb'}Message: After careful evaluation and research the service-arch team settled into This timeout status now competes with `NetworkInterfaceExceededTimeLimit` This commit also:
(cherry picked from commit 3fd2a61085f8c72a13d1bc00ad95c460c69eb8a8) |
| Comment by Githook User [ 27/Jun/22 ] |
|
Author: {'name': 'Varun Ravichandran', 'email': 'varun.ravichandran@mongodb.com', 'username': 'varunravi98'}Message: After careful evaluation and research the service-arch team settled into This timeout status now competes with `NetworkInterfaceExceededTimeLimit` This commit also:
|
| Comment by Daniel Morilha (Inactive) [ 20/May/22 ] |
|
After an hour-long discussion with different stakeholders (reviewers to the proposed change) requirements change and so did my understanding of what is expected from this ticket and where to draw the boundary line in terms of responsibilities. George pointed out for the existence of RemoteCommandRequest::timeoutCode and how that is used in order to override the expected user-defined timeout. While I've noticed that it never occurred to me that that would be the case for this given instance. It was then agreed ConnectionPool is ambiguous as it may override the user-defined timeout duration based on internal heuristics still returning the same error code, namely "NetworkInterfaceTimeLimitException" and how that should be addressed by both ConnectionPool and NetworkInterfaceTL appropriately, the user-specified exception should be returned iff the user-specified deadline expired. It was also reiterated that given errors are returned to the original remote command requester, the requester might be at a better place to treat those errors as they please as moving this sort of logic into parts that are supposed to be generic defeats the very same genericity. It was also evaluated that the errors occurring in BF-20554 while seem to overlap with this request have nothing to deal with what has been asked here. There, the problem seems to be brittle tests which rely on a specific error being reported when transactions conflict may fail while running into a resource constrained environment. Another specific change is expected for that. We also had a chance to evaluate the feseability to retire OperationContext --> storeMaxTimeMS, restoreMaxTimeMS and {}maxTime{_} as it may not being used any longer. Code changes will follow... |
| Comment by Shane Harvey [ 05/May/22 ] |
|
max.hirschhorn@mongodb.com brought to my attention some subtlety that I missed in the previous question. If the client sends maxTimeMS 60 and the server only waits 30 seconds before returning MaxTimeMSExpired, that behavior seems unexpected. In this case it should be fine to return a non-MaxTimeMSExpired error code since the time limit hasn’t actually expired. Ideally MaxTimeMSExpired would only be returned when maxTimeMS has actually expired. Note that the opposite (ie "when maxTimeMS expires the server always returns MaxTimeMSExpired") isn't necessarily true. It's expected that the server may return other error codes. For example the server could be in the process of returning a shutdown error when maxTimeMS expires in which case returning ShutdownInProgress is fine. |
| Comment by Shane Harvey [ 29/Apr/22 ] |
|
It would be great for the server to return MaxTimeMSExpired instead of NetworkInterfaceExceededTimeLimit. Clients expect the MaxTimeMSExpired error to be returned when a request exceeds MaxTimeMS. |
| Comment by Daniel Morilha (Inactive) [ 29/Apr/22 ] |
|
Hello shane.harvey@mongodb.com, after discussing this SERVER ticket with both Max and Sam we've identified two scenarios which might concern drivers, they are:
This exception only tells a connection wasn't available within the requested timeframe. The underlying reason gets hidden away from the requester and we believe it should be re-tried. The easiest approach would be to have it rewritten to MaxTimeMSExpired at either the client or the sharding layer. I hope you may answer in terms of the client expectation, if it would make sense to rewrite one exception into the other, and if so, if that could be done even if the specified timeout was shorter than what the command specified or if the executor should add a retry logic itself. Please let me know if all of that makes sense, |
| Comment by Daniel Morilha (Inactive) [ 26/Apr/22 ] |
|
After discussing this with amirsaman.memaripour@mongodb.com, MaxTimeMSExpired only relates with client operations.
Based on the above stack trace I think the exception should go on RemoteCommandRetryScheduler which is a client's concern. |
| Comment by Daniel Morilha (Inactive) [ 12/Apr/22 ] |
| Comment by Rui Liu [ 12/Apr/22 ] |
|
I have recently seen BF-24814 that seems to be related to this. It happens on the {taskExecutorPoolSize: 4}build variant. I will temporarily allow the NetworkInterfaceExceededTimeLimit code and leave a TODO for this ticket. |
| Comment by Daniel Morilha (Inactive) [ 11/Apr/22 ] |
|
The error surfaces while running `jstests/core/txns/many_txns.js` on a Giving the load, query commands may exceed their `maxTimeMS` and may get It is worth mentioning this could not be reproduced locally. |
| Comment by Max Hirschhorn [ 08/Apr/22 ] |
|
Reopening this ticket because the MaxTimeMSExpired error code (50) is semantically meaningful to drivers (e.g. ExecutionTimeout exception type, behavior of adding label UnknownTransactionCommitResult to commitTransaction with maxTimeMS) and must be used for errors related to using maxTimeMS. See also |
| Comment by Lauren Lewis (Inactive) [ 24/Feb/22 ] |
|
We haven’t heard back from you for at least one calendar year, so this issue is being closed. If this is still an issue for you, please provide additional information and we will reopen the ticket. |
| Comment by David Storch [ 09/Sep/19 ] |
|
The query team believes that this should be triaged by the service architecture team. |