[SERVER-58116] StaleShardVersion error not triggering a refresh in moveChunk Created: 28/Jun/21 Updated: 29/Oct/23 Resolved: 14/Mar/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.0.0, 5.1.0 |
| Fix Version/s: | 6.0.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Simon Gratzer (Inactive) | Assignee: | Antonio Fuschetto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | https://jira.mongodb.org/browse/BF-21676?filter=-1 Rerun |
||||||||
| Sprint: | Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10, Sharding EMEA 2022-01-24, Sharding EMEA 2022-02-07, Sharding EMEA 2022-02-21, Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 32 | ||||||||
| Description |
|
When a stale mongos gets a moveChunk command it first does a refresh. However a mongos might have refreshed from a stale configsvr secondary that has not seen the latest split / merge operation on yet. The mongos may not yet know of a clusterTime inclusive of the split because another mongos did it, so there is no causal consistency guarantee. For a moveChunk operation the shard will later throw a StaleShardVersion error here. However the mongos will not retry the operation because this code is missing the StaleConfigInfo extra information, which causes the code in strategy.cpp to abort a retry attempt. Possible solutions:
|
| Comments |
| Comment by Antonio Fuschetto [ 14/Mar/22 ] |
|
kaloian.manassiev, that's correct. Throwing a StaleConfigInfo here resolves the problem reported in the ticket. Thanks. |
| Comment by Kaloian Manassiev [ 11/Mar/22 ] |
|
antonio.fuschetto, I have finally committed |
| Comment by Kaloian Manassiev [ 11/Feb/22 ] |
|
I think you are right, antonio.fuschetto. You can block it on |
| Comment by Antonio Fuschetto [ 10/Feb/22 ] |
|
The current logic, that retargets a failed command with an error in the NeedRetargettingError category, needs the StaleConfigInfo as extra info to proceed. When missing, the command will not be retargeted and this is what happens in the case of the problem that gave rise to this ticket. A StaleShardVersion error is raised in the event of the shard is unable to find chunk with the requested bounds, and no extra info is passed. As a consequence, the mongos won't retargeted the moveChunk command. Kal is currently working on kaloian.manassiev, any thoughts? |