[SERVER-61879] Refreshes to recover migrations must never join ongoing refreshes Created: 03/Dec/21  Updated: 29/Oct/23  Resolved: 10/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.2.0, 5.0.5, 5.1.1
Fix Version/s: 5.0.8, 5.3.0

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Participants:
Case:
Linked BF Score: 134

 Description   

On step-up - more specifically during drain mode - a thread calling into onShardVersionMismatch is spawned in order to recover potential outstanding migrations.

The implementation of onShardVersionMismatch is assuming that - during drain mode - no other refresh could be running because user requests don't get served. However, this turns out to be incorrect because previously spawned refreshes are not killed on step-down/up as they are happening on a different thread than the command that spawned them.

It is then possible that the recovery on a primary node joins a refresh that started when the node was secondary, skipping the recovery.



 Comments   
Comment by Githook User [ 08/Apr/22 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-61879 Refreshes to recover migrations must never join ongoing refreshes
Branch: v5.0
https://github.com/mongodb/mongo/commit/c4032e9512447e029fdddbc770826229e871f810

Comment by Githook User [ 10/Jan/22 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-61879 Refreshes to recover migrations must never join ongoing refreshes
Branch: master
https://github.com/mongodb/mongo/commit/a4abad42efb8a44b4b3fa40831e50060530b9938

Comment by Kaloian Manassiev [ 16/Dec/21 ]

I think the best way to address this would be to make the semantics of CSS::clearFilteringMetadata() to match that of ReadThroughCache::invalidate. In other words it is guaranteed that once clearFilteringMetadata is called, the effects of any concurrently running onShardVersionMismatch/recoverRefreshShardVersion will not become visible.

Generated at Thu Feb 08 05:53:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.