[SERVER-85145] ShardNotFound error should not be bubbled up when concurrently removing a shard and running operations Created: 12/Jan/24  Updated: 13/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pol Pinol Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File repro.patch    
Issue Links:
Related
Assigned Teams:
Catalog and Routing
Operating System: ALL
Participants:

 Description   

In HELP-54194, we discovered that there are some commands that may fail when a removeShard is taking place / draining a shard. As an example “listIndexes”. It is expected to hit ShardNotFound as a transient error triggered by a specific timing and in a specific window of time, and bubble up to the user application. The command failed can be perfectly retried and successfully executed after that. The exact reproducible test is attached to the comments.

The problem is that the user will be able to see ShardNotFound bubble up when it may not be necessary, i.e. the mongos or driver (implementation decision) should retry the operation. 

Summarizing, the goal of this ticket is to list all the commands triggered by the reproducible and investigate / work on a feasible solution to retry ShardNotFound without bubbling up to the user when is not necessary - as we do with other transient errors.


Generated at Thu Feb 08 06:56:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.