[SERVER-53973] Migration manager recovery should handle failed findIntersectingChunk during refineShardKey Created: 22/Jan/21 Updated: 29/Oct/23 Resolved: 30/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.4.4 |
| Fix Version/s: | 4.4.6, 5.0.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alexander Taskov (Inactive) | Assignee: | Tommaso Tocci |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | Sharding-EMEA, sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||
| Sprint: | Sharding 2021-02-22, Sharding 2021-03-08, Sharding 2021-03-22, Sharding 2021-04-05 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 26 | ||||||||||||||||
| Description |
|
The call to findIntersectingChunkWithSimpleCollation is being called with the original shard key after a refineShardKey operation. This results in a ShardKeyNotFound exception being thrown which goes unhandled. This should have been addressed by the changes in https://jira.mongodb.org/browse/SERVER-48365 by making sure that the migration recovery entries are written while holding the dist lock. However, there is an interleaving where this is not the case. |
| Comments |
| Comment by Githook User [ 07/Apr/21 ] |
|
Author: {'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}Message: |
| Comment by Githook User [ 30/Mar/21 ] |
|
Author: {'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}Message: |
| Comment by Tommaso Tocci [ 27/Mar/21 ] |
|
kaloian.manassiev was right about the racing scenario, I've been able to reproduce it by adding a couple of sleeps. This is patch to reproduce the error |
| Comment by Kaloian Manassiev [ 09/Mar/21 ] |
|
I think the problem is just that the following race condition can happen:
I think it will be extremely difficult to fix the distributed lock unlocking issue, so instead we should just address the crash by try/catching the lookup of the chunk in the ChunkManager and abandoning that migration recovery in that case. |
| Comment by Tommaso Tocci [ 04/Mar/21 ] |
|
I'll bounce back this to alex.taskov to add more context. |