[SERVER-61879] Refreshes to recover migrations must never join ongoing refreshes Created: 03/Dec/21 Updated: 29/Oct/23 Resolved: 10/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 5.2.0, 5.0.5, 5.1.1 |
| Fix Version/s: | 5.0.8, 5.3.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pierlauro Sciarelli | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||
| Sprint: | Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Linked BF Score: | 134 | ||||||||||||
| Description |
|
On step-up - more specifically during drain mode - a thread calling into onShardVersionMismatch is spawned in order to recover potential outstanding migrations. The implementation of onShardVersionMismatch is assuming that - during drain mode - no other refresh could be running because user requests don't get served. However, this turns out to be incorrect because previously spawned refreshes are not killed on step-down/up as they are happening on a different thread than the command that spawned them. It is then possible that the recovery on a primary node joins a refresh that started when the node was secondary, skipping the recovery. |
| Comments |
| Comment by Githook User [ 08/Apr/22 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |
| Comment by Githook User [ 10/Jan/22 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |
| Comment by Kaloian Manassiev [ 16/Dec/21 ] |
|
I think the best way to address this would be to make the semantics of CSS::clearFilteringMetadata() to match that of ReadThroughCache::invalidate. In other words it is guaranteed that once clearFilteringMetadata is called, the effects of any concurrently running onShardVersionMismatch/recoverRefreshShardVersion will not become visible. |