[SERVER-23625] Some read-only operations (eg count,aggregate) hang indefinitely if the primary for the shard is unreachable from mongos Created: 08/Apr/16  Updated: 06/Feb/19  Resolved: 06/Feb/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Blake Oler
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File primary_shard_server_blackholed_from_mongos.js    
Issue Links:
Duplicate
duplicates SERVER-35679 General Interruption Facility Closed
Related
related to SERVER-23427 Add test to ensure that read-only ope... Closed
related to SERVER-17825 Remove setShardVersion from shard ver... Closed
related to SERVER-21005 Consistent maxTimeMs / timeout / inte... Closed
is related to SERVER-24457 Some commands fail when a shard they ... Closed
Operating System: ALL
Sprint: Sharding 2019-02-25
Participants:

 Description   

Even if you specify a 'secondary' read preference, we still try to call setShardVersion on the primary when running count, agg, m/r, etc. If the replica set monitor has already detected that the primary is unreachable then we skip the setShardVersion call and it works. If we have not yet detected that the node we knew to once be primary has since become unreachable, we'll try to send setShardVersion to it, and that will hang forever.



 Comments   
Comment by Blake Oler [ 04/Feb/19 ]

This has been fixed on the current master branch. Jason Carey's interruptibilty patch in SERVER-35679 allows the AsyncRequestSender (the mechanism used to send commands from mongos to shards) to be interruptible. Spencer Brody's ShardRemote patch in SERVER-37329 allows ShardRemote to be interruptible as well. These mechanisms are used for count, find, etc.

Both of these changes are only in the current working branch, meaning that maxTimeMS support is incomplete on previous releases. Do we seek to backport behavior to previous releases as part of this ticket kaloian.manassiev?

Comment by Spencer Brody (Inactive) [ 08/Apr/16 ]

Attaching jstest that reproduces the problem(s)

Generated at Thu Feb 08 04:03:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.