[SERVER-62340] Tenant Migration can lead to leakage of "TenantMigrationBlockerAsync" threads. Created: 04/Jan/22 Updated: 29/Oct/23 Resolved: 14/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.3.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Didier Nadeau |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Server Serverless 2022-01-24 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 15 | ||||||||
| Description |
|
While investigating the BF, it revealed that the tenant migration donor code can lead to leakage of "TenantMigrationBlockerAsync" threads. Consider the below scenario: 1) Donor starts migration for tenant foo. This results in calling of TenantMigrationDonorAccessBlocker destructor, which in turn results in calling of _asyncBlockingOperationsExecutor's destructor (This thread pool executor is shared by all donor access blockers and is destroyed when no access blockers exist), that makes the executor to shutdown and waits for executor to join. But, the executor join() is blocked waiting for current "TenantMigrationBlockerAsync-X" thread to join and the current "TenantMigrationBlockerAsync-X" thread is waiting for executor _join() to complete, leading to self-deadlock and leakage of "TenantMigrationBlockerAsync" threads. Note: The same problem exist on the recipient side as well. |
| Comments |
| Comment by Githook User [ 14/Jan/22 ] | |||||||
|
Author: {'name': 'Didier Nadeau', 'email': 'didier.nadeau@mongodb.com', 'username': 'nadeaudi'}Message: | |||||||
| Comment by Esha Maharishi (Inactive) [ 12/Jan/22 ] | |||||||
|
Sounds good. Just a note that I'm not sure if first bullet would be an issue in practice. I didn't get to mention this on Zoom, but there are existing places where we block user threads and unblock them all at once, such as on exiting the sharding migration critical section. | |||||||
| Comment by Didier Nadeau [ 12/Jan/22 ] | |||||||
|
Following a discussion with esha.maharishi and suganthi.mani , we decided to go ahead with Suganthi's idea to move the ownership of the ThreadPool to `TenantMigrationAccessBlockerRegistry` and make it a `shared_ptr`. As this means it will always exist, we will set the `minThread` to 0 so that, when there is no migration, no threads exist to remove impact on non-serverless instances. We decided not to go ahead and integrate the `checkIfCanReadOrBlock`'s future into the caller's future chain for two reasons :
| |||||||
| Comment by Suganthi Mani [ 11/Jan/22 ] | |||||||
|
didier.nadeau To my understanding, there will be an active POS executor only on primary. But, we maintain access blocker + it's executor both on primary and secondaries. So, I think probably the idea#2 (Retrieve the access_blocker's executor and run a simple lambda on the POS's executor) won't work. | |||||||
| Comment by Didier Nadeau [ 11/Jan/22 ] | |||||||
|
Some ideas :
| |||||||
| Comment by Esha Maharishi (Inactive) [ 10/Jan/22 ] | |||||||
|
didier.nadeau and matt.broadstone to triage along with Implement Split work tomorrow. |