[SERVER-53973] Migration manager recovery should handle failed findIntersectingChunk during refineShardKey Created: 22/Jan/21  Updated: 29/Oct/23  Resolved: 30/Mar/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.4
Fix Version/s: 4.4.6, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Alexander Taskov (Inactive) Assignee: Tommaso Tocci
Resolution: Fixed Votes: 0
Labels: Sharding-EMEA, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-61755 Migration recovery should handle refi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2021-02-22, Sharding 2021-03-08, Sharding 2021-03-22, Sharding 2021-04-05
Participants:
Linked BF Score: 26

 Description   

The call to findIntersectingChunkWithSimpleCollation is being called with the original shard key after a refineShardKey operation. This results in a ShardKeyNotFound exception being thrown which goes unhandled.

This should have been addressed by the changes in https://jira.mongodb.org/browse/SERVER-48365 by making sure that the migration recovery entries are written while holding the dist lock. However, there is an interleaving where this is not the case.



 Comments   
Comment by Githook User [ 07/Apr/21 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-53973 Migration manager recovery should handle failed findIntersectingChunk during refineShardKey
Branch: v4.4
https://github.com/mongodb/mongo/commit/469b1101cea8cf0c5b34437cdeb391b7002f7d19

Comment by Githook User [ 30/Mar/21 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-53973 Migration manager recovery should handle failed findIntersectingChunk during refineShardKey
Branch: master
https://github.com/mongodb/mongo/commit/1256d8e67a6b4bb88875cbc112987bbc8fffe76b

Comment by Tommaso Tocci [ 27/Mar/21 ]

kaloian.manassiev was right about the racing scenario, I've been able to reproduce it by adding a couple of sleeps. This is patch to reproduce the error

Comment by Kaloian Manassiev [ 09/Mar/21 ]

I think the problem is just that the following race condition can happen:

  1. A migration starts, which takes the dist lock, then writes document with shardKey {a: 1}, which majority commits (so there is actually no order issue)
  2. Step-down happens of that same node, followed by step-up, but on the config server we throw-out the dist locks unconditionally
  3. Before the Balancer recovery has managed to run and re-acquire the dist locks, a refine shard key sneaks-in
  4. Now the Balancer recovery has a wrong shard key

 

I think it will be extremely difficult to fix the distributed lock unlocking issue, so instead we should just address the crash by try/catching the lookup of the chunk in the ChunkManager and abandoning that migration recovery in that case.

Comment by Tommaso Tocci [ 04/Mar/21 ]

I'll bounce back this to alex.taskov to add more context.

Generated at Thu Feb 08 05:32:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.