[SERVER-71618] Investigate non auto-retry commands on StaleConfig error Created: 25/Nov/22  Updated: 07/Mar/23  Resolved: 07/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Silvia Surroca Assignee: Silvia Surroca
Resolution: Done Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Sharding EMEA
Sprint: Sharding EMEA 2022-12-26, Sharding EMEA 2023-01-09, Sharding EMEA 2023-01-23
Participants:

 Description   

We have received recently some complaints about the non-retry behaviour on StaleConfig error when running a count command.

When mongos receives a count command, it's sending out a scatter-gather request to all shards owning chunks. Since the scatter-gather utility is not retrying on StaleConfig errors, the StaleConfig error is being reported to the user. 

After browsing the code, we have seen there are several commands using the same scatterGatherVersionedTargetByRoutingTable utility, so they are likely to don't reply on StaleConfig either.

The aim of this ticket is to investigate more deeply which commands are not retrying on StaleConfig and to figure out how to solve this.



 Comments   
Comment by Silvia Surroca [ 07/Mar/23 ]

The assumption described on the Jira description is wrong since the retry on StaleConfig error is handled on the service_entry_point, which is on a higher level than the scatter-gather machinery.

So, the fact that scatter-gather is not retrying on StaleConfig errors is correct since that exception must bubble up until being handled by the service_entry_point.

Closing this ticket as Done since the investigation has concluded.

 

PD: thanks tommaso.tocci@mongodb.com for reaching me out

Generated at Thu Feb 08 06:19:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.