[SERVER-77309] An interleaving might cause a migration to continue when it shouldn't Created: 19/May/23  Updated: 29/Oct/23  Resolved: 26/May/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.3.0, 7.0 Required, 6.0.5, 6.0.6, 6.3.1
Fix Version/s: 7.1.0-rc0, 6.0.7, 7.0.0-rc3

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0
Sprint: Sharding EMEA 2023-05-29
Participants:
Linked BF Score: 135

 Description   

Currently we have an exclusive CSR lock at the beginning of the migration that is used to atomically check the allowMigrations metadata flag and then set the ScopedRegisterer for the migration. Additionally in the refresh code, we have a shared CSR lock used to abort any ongoing migration registered in the migration's decoration. However, that lock goes out of scope, before taking it again in exclusive mode to install the new metadata, making the following interleaving possible:

Suppose we have two threads thread1 and thread2. thread1 starts executing a migration command, and thread2 a refresh triggered as part of the setAllowMigrations code (which could be the result of a DDL that used the stopMigration helper).

1. thread1 executes the migration's refresh, but does not see the setAllowMigration's commit
2. A race for the CSR lock happens, on one side thread1 goes for the migration CSR lock and thread2 goes for the refresh CSR lock, but thread2 is the winner
3. In the refresh we check the migration decoration, but we don't find any migration to abort
4. A second race for the CSR lock happens, between thread1 that goes again for the migration CSR lock and thread2 that goes for the metadata installation CSR lock, thread1 wins the lock, but because of 1 the allowMigrations check passes, allowing the migration to continue

The condition described by 4 could cause a migration acquiring the critical section while a DDL requires it (for example, a rename participant might try to acquire the critical section when the migration already held it).

We could leave the initial migration check in the refresh as an optimistic verification, but we need to re-check for migrations while holding the exclusive lock and before installing the new metadata.



 Comments   
Comment by Githook User [ 06/Jun/23 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-77309 Change db and collection locks to IX before waiting for migrations to be aborted during refresh
Branch: v6.0
https://github.com/mongodb/mongo/commit/1e289796f33690c55227bbc7a657612d97b80ed3

Comment by Githook User [ 06/Jun/23 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-77309 Add check to abort ongoing migration inside refresh's exclusive CSR lock

(cherry picked from commit 533eae2e276298a287a47458ee48d4c481d01788)
Branch: v6.0
https://github.com/mongodb/mongo/commit/74fadf1f2156d94fe497047e0b2ca1740fd1fd4a

Comment by Githook User [ 29/May/23 ]

Code review:
SERVER-77309 Add check to abort ongoing migration inside refresh's exclusive CSR lock
https://github.com/mongodb/mongo/pull/1550
Base branch: v6.0

Comment by Githook User [ 26/May/23 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-77309 Add check to abort ongoing migration inside refresh's exclusive CSR lock

(cherry picked from commit 533eae2e276298a287a47458ee48d4c481d01788)
Branch: v7.0
https://github.com/mongodb/mongo/commit/e02c4a0f9b604a88973cbd900be04d7380e9bb89

Comment by Githook User [ 26/May/23 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-77309 Add check to abort ongoing migration inside refresh's exclusive CSR lock
Branch: master
https://github.com/mongodb/mongo/commit/533eae2e276298a287a47458ee48d4c481d01788

Generated at Thu Feb 08 06:35:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.