[SERVER-30768] Primary queries using maxTimeMS cause temporary shard write unavailability if ExceededTimeLimit Created: 21/Aug/17 Updated: 30/Oct/23 Resolved: 12/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.6, 3.6.0-rc4 |
| Fix Version/s: | 3.4.19, 3.6.1, 3.7.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Oleg Rekutin | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | bkp | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v3.6, v3.4
|
||||||||
| Steps To Reproduce: |
|
||||||||
| Sprint: | Sharding 2017-12-04, Sharding 2017-12-18 | ||||||||
| Participants: | |||||||||
| Description |
|
Setup: Sharded cluster with replica set shards. MongoDB v3.4.6. WiredTiger with snappy. Problem: When a query fails due to a max time MS timeout (which happens now and again, since we are using a fairly tight limit), MongoS marks the node as failed. (This is incorrect.. the node is NOT failed). Result: Since the query was against the primary, and the primary is marked as failed, subsequent write operations fail due to unavailability of the primary. This lasts for a second or a few, presumably until MongoS heartbeat monitor detects the primary as up. This renders $maxTimeMS dangerous to use when making primary-side queries--any timed out query will temporarily make the shard write-unavailable. Furthermore, it seems that architecturally it's wrong to have the MongoS mark the host as failed, but then not trigger a failover. This means that the MongoS "failed primary" logic is completely disconnected from the actual primary/replica failover/election logic. This means that when MongoS reports "no primary found" for a shard, it's not because there's actually no primary in that replica set! (there is a primary and it's healthy) (I think that this problem applies to queries that hit replicas as well, where the replica is marked as failed, but I haven't specifically tested that.) |
| Comments |
| Comment by Githook User [ 20/Dec/18 ] | |
|
Author: {'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow'}Message: (cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea) | |
| Comment by Ramon Fernandez Marina [ 20/Dec/18 ] | |
|
paulw@bitheads.com, there's a pull request for the backport that is currently being assessed for quality and safety. We'll post updates to this ticket as they happen. Thanks for your continued patience, | |
| Comment by Paul Winterhalder [ 08/Dec/18 ] | |
|
Question - is there a technical reason why this fix wasn't also applied to the v3.4 branch? We just upgraded from 3.2 -> 3.4, and our production servers are now hitting this issue... We're looking at the possibility of applying the same fix to the 3.4 branch and rebuilding - but maybe those more intimate with the code have a reason for why it wouldn't work? (and thus wasn't done?). Any thoughts would be appreciated. Thanks! Paul. | |
| Comment by Githook User [ 11/Dec/17 ] | |
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: (cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea) | |
| Comment by Githook User [ 11/Dec/17 ] | |
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: | |
| Comment by Kaloian Manassiev [ 21/Nov/17 ] | |
|
From looking at the code, I do not see a good reason to notify the replica set monitor on ExceededTimeLimit errors, because these should be an indicator that a user-requested time limit was exceeded, but not necessarily a problem with the system. As part of this fix we should inspect all the places, which return ExceededTmeLimit and also possibly add NetworkInterfaceExceededTimeLimit to the list of errors, which mark shards as failed. | |
| Comment by Oleg Rekutin [ 24/Aug/17 ] | |
|
Thank you Mark! I've been paying attention to the message for marking hosts as failed. I noticed that there's another circumstance that led to a primary being marked as unavailable:
Should this be a separate issue? It feels related to this. It seems incorrect to flag the entire primary as down just because there is a stale config data refresh happening for one collection (out of many). | |
| Comment by Mark Agarunov [ 22/Aug/17 ] | |
|
Hello oleg@evergage.com, Thank you for the detailed report. I've managed to reproduce this behavior using the steps you outlined and have set the fixVersion to "Needs Triage" for this issue to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen. Thanks, | |
| Comment by Oleg Rekutin [ 21/Aug/17 ] | |
|
It looks like max_time_ms_sharded.js asserts only that the timeout was correctly sent back, but does not assert that the queried shard server is still reachable afterwards (and a primary is present). | |
| Comment by Oleg Rekutin [ 21/Aug/17 ] | |
|
I believe the problem lies within shard_remote.cpp in ShardRemote::updateReplSetMonitor: https://github.com/mongodb/mongo/blob/r3.4.6/src/mongo/s/client/shard_remote.cpp#L135 It considers ExceededTimeLimit to be an unreachable host situation. That might have been ok if maxTimeMS were used for internal system queries only. However, it's not OK when it's an application query specifying maxTimeMS. I recommend removing the clause that marks the host as failed on ExceededTimeLimit. That, or we have to detect whether the time limit was exceeded on a system or internal command or an application find. |