[SERVER-58116] StaleShardVersion error not triggering a refresh in moveChunk Created: 28/Jun/21  Updated: 29/Oct/23  Resolved: 14/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.0, 5.1.0
Fix Version/s: 6.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Simon Gratzer (Inactive) Assignee: Antonio Fuschetto
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-63327 Remove usages of the StaleShardVersio... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

https://jira.mongodb.org/browse/BF-21676?filter=-1

Rerun

Sprint: Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10, Sharding EMEA 2022-01-24, Sharding EMEA 2022-02-07, Sharding EMEA 2022-02-21, Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21
Participants:
Linked BF Score: 32

 Description   

When a stale mongos gets a moveChunk command it first does a refresh. However a mongos might have refreshed from a stale configsvr secondary that has not seen the latest split / merge operation on yet.

The mongos may not yet know of a clusterTime inclusive of the split because another mongos did it, so there is no causal consistency guarantee.

For a moveChunk operation the shard will later throw a StaleShardVersion error here.

However the mongos will not retry the operation because this code is missing the StaleConfigInfo extra information, which causes the code in strategy.cpp to abort a retry attempt.

Possible solutions:

  • Attach StaleConfigInfo to the exceptions on the shard
  • Perform a version check on the configsvr

 



 Comments   
Comment by Antonio Fuschetto [ 14/Mar/22 ]

kaloian.manassiev, that's correct. Throwing a StaleConfigInfo here resolves the problem reported in the ticket. Thanks.

Comment by Kaloian Manassiev [ 11/Mar/22 ]

antonio.fuschetto, I have finally committed SERVER-63327. Let me know if that unblocks you now.

Comment by Kaloian Manassiev [ 11/Feb/22 ]

I think you are right, antonio.fuschetto. You can block it on SERVER-63327.

Comment by Antonio Fuschetto [ 10/Feb/22 ]

The current logic, that retargets a failed command with an error in the NeedRetargettingError category, needs the StaleConfigInfo as extra info to proceed. When missing, the command will not be retargeted and this is what happens in the case of the problem that gave rise to this ticket.

StaleShardVersion error is raised in the event of the shard is unable to find chunk with the requested bounds, and no extra info is passed. As a consequence, the mongos won't retargeted the moveChunk command.

Kal is currently working on SERVER-63327, with the goal of getting rid of the StaleShardVersion error and supplanting it with StaleConfig error. The latter is in the same NeedRetargettingError category and also provides extra info (i.e., StaleConfigInfo). This implies that the current logic on mongos will be able to retarget the moveChunk command when the shard will be unable to find the chunk with the requested bounds.

kaloian.manassiev, any thoughts?

Generated at Thu Feb 08 05:43:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.