[SERVER-30768] Primary queries using maxTimeMS cause temporary shard write unavailability if ExceededTimeLimit Created: 21/Aug/17  Updated: 30/Oct/23  Resolved: 12/Dec/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.6, 3.6.0-rc4
Fix Version/s: 3.4.19, 3.6.1, 3.7.1

Type: Bug Priority: Major - P3
Reporter: Oleg Rekutin Assignee: Jack Mulrow
Resolution: Fixed Votes: 3
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6, v3.4
Steps To Reproduce:
  1. Set up a sharded cluster with replica sets
  2. Issue a find(query).maxTimeMS(1) command to cause a timeout, such that the Mongo shell prints:

    E QUERY    [thread1] Error: error: {
    	"ok" : 0,
    	"errmsg" : "operation exceeded time limit",
    	"code" : 50,
    	"codeName" : "ExceededTimeLimit"
    }
    

  3. At this point, MongoS logs that it marked the primary as failed:

    # 2017-08-21T21:11:05.717+0000 I NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-1-0] Marking host shard3.xyz:27017 as failed :: caused by :: ExceededTimeLimit: operation exceeded time limit

  4. Immediately issue a write command, which will fail with:

    error 133: Write failed with error code 133 and error message 'could not find host matching read preference { mode: "primary", tags: [ {} ] } for set xyz_shard3

Sprint: Sharding 2017-12-04, Sharding 2017-12-18
Participants:

 Description   

Setup:

Sharded cluster with replica set shards. MongoDB v3.4.6. WiredTiger with snappy.
Collection X exists only on 1 shard (not sharded, probably not relevant).

Problem:

When a query fails due to a max time MS timeout (which happens now and again, since we are using a fairly tight limit), MongoS marks the node as failed. (This is incorrect.. the node is NOT failed).

Result:

Since the query was against the primary, and the primary is marked as failed, subsequent write operations fail due to unavailability of the primary. This lasts for a second or a few, presumably until MongoS heartbeat monitor detects the primary as up.

This renders $maxTimeMS dangerous to use when making primary-side queries--any timed out query will temporarily make the shard write-unavailable.

Furthermore, it seems that architecturally it's wrong to have the MongoS mark the host as failed, but then not trigger a failover. This means that the MongoS "failed primary" logic is completely disconnected from the actual primary/replica failover/election logic. This means that when MongoS reports "no primary found" for a shard, it's not because there's actually no primary in that replica set! (there is a primary and it's healthy)

(I think that this problem applies to queries that hit replicas as well, where the replica is marked as failed, but I haven't specifically tested that.)



 Comments   
Comment by Githook User [ 20/Dec/18 ]

Author:

{'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow'}

Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit

(cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea)
Branch: v3.4
https://github.com/mongodb/mongo/commit/9aafd59034c627b4a4cbda492bce3cc6cbef7b9a

Comment by Ramon Fernandez Marina [ 20/Dec/18 ]

paulw@bitheads.com, there's a pull request for the backport that is currently being assessed for quality and safety. We'll post updates to this ticket as they happen.

Thanks for your continued patience,
Ramón.

Comment by Paul Winterhalder [ 08/Dec/18 ]

Question - is there a technical reason why this fix wasn't also applied to the v3.4 branch?

We just upgraded from 3.2 -> 3.4, and our production servers are now hitting this issue...  We're looking at the possibility of applying the same fix to the 3.4 branch and rebuilding - but maybe those more intimate with the code have a reason for why it wouldn't work? (and thus wasn't done?).

Any thoughts would be appreciated. Thanks!

Paul.

Comment by Githook User [ 11/Dec/17 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit

(cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea)
Branch: v3.6
https://github.com/mongodb/mongo/commit/5708ae71e0334874e63951a2868ec30e09287806

Comment by Githook User [ 11/Dec/17 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit
Branch: master
https://github.com/mongodb/mongo/commit/06b3488461d820c9081e5d7fca0fc028470922ea

Comment by Kaloian Manassiev [ 21/Nov/17 ]

From looking at the code, I do not see a good reason to notify the replica set monitor on ExceededTimeLimit errors, because these should be an indicator that a user-requested time limit was exceeded, but not necessarily a problem with the system.

As part of this fix we should inspect all the places, which return ExceededTmeLimit and also possibly add NetworkInterfaceExceededTimeLimit to the list of errors, which mark shards as failed.

Comment by Oleg Rekutin [ 24/Aug/17 ]

Thank you Mark!

I've been paying attention to the message for marking hosts as failed. I noticed that there's another circumstance that led to a primary being marked as unavailable:

2017-08-23T15:38:27.153+0000 I NETWORK  [conn28770] Marking host m17a.xyz:27017 as failed :: caused by :: RecvStaleConfig: can't query replica set node m17a.xyz:27017: stale config in DBClientCursor::dataReceived() ( ns : somedb.some_collection, received : 0|0||000000000000000000000000, wanted : 1|0||597ba0655167407959b03891, recv )

Should this be a separate issue? It feels related to this. It seems incorrect to flag the entire primary as down just because there is a stale config data refresh happening for one collection (out of many).

Comment by Mark Agarunov [ 22/Aug/17 ]

Hello oleg@evergage.com,

Thank you for the detailed report. I've managed to reproduce this behavior using the steps you outlined and have set the fixVersion to "Needs Triage" for this issue to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen.

Thanks,
Mark

Comment by Oleg Rekutin [ 21/Aug/17 ]

It looks like max_time_ms_sharded.js asserts only that the timeout was correctly sent back, but does not assert that the queried shard server is still reachable afterwards (and a primary is present).

Comment by Oleg Rekutin [ 21/Aug/17 ]

I believe the problem lies within shard_remote.cpp in ShardRemote::updateReplSetMonitor:

https://github.com/mongodb/mongo/blob/r3.4.6/src/mongo/s/client/shard_remote.cpp#L135

It considers ExceededTimeLimit to be an unreachable host situation. That might have been ok if maxTimeMS were used for internal system queries only. However, it's not OK when it's an application query specifying maxTimeMS.

I recommend removing the clause that marks the host as failed on ExceededTimeLimit. That, or we have to detect whether the time limit was exceeded on a system or internal command or an application find.

Generated at Thu Feb 08 04:24:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.