[SERVER-53168] Support 50 concurrent migrations on a single recipient Created: 01/Dec/20  Updated: 29/Oct/23  Resolved: 17/Feb/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Suganthi Mani Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: pm-1791_milestone-B
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-54090 SSLConfiguration use after free when ... Closed
depends on SERVER-54328 Refactor creation of transient SSLCon... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2021-02-08, Repl 2021-02-22
Participants:

 Description   

Currently our tenant migration recipient thread pool default size is 8 (and it’s a tunable server startup parameter). For each migration, we have components, like oplog fetcher & cloner, on recipient side that would do some synchronous job (fetching data from remote donor node) on the tenant migration recipient thread , without yielding the thread. With the default thread pool size as 8, we can expect only at most 3 concurrent migration to be initiated on recipient side (per migration, 2 threads for sync jobs + 1 thread for async job),. Otherwise, concurrent tenant migration can lead to complete stalling of all active tenant migrations on recipient side.

Consider the case, say, tenant migration recipient thread pool size is 4.
1) Assume Recipient received recipeintSyncData comand for migration id 1, 2,3 and all of them have started the oplog fetcher and at runQuery(). At this point, we are left with only one free worker thread in the tenant migration recipient thread pool
2) Now, the recipient received recipeintSyncData comand for migration id #4, that would successfully able to start the oplog fetcher

So, now, we have no free worker threads left in the tenant migration recipient thread pool to start the cloner. All 4 tenant migrations would hang on recipient side until we cancel one migration explicitly using ForgetMigration cmd.



 Comments   
Comment by Githook User [ 17/Feb/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-53168: Set maxTenantMigrationDonorThreadPoolSize default to 128
Branch: master
https://github.com/mongodb/mongo/commit/8bf7eabbeeb393d30f6dfbbf941505a6d282de70

Comment by Andrew Shuvalov (Inactive) [ 05/Feb/21 ]

Added SERVER-54328 as dependency, this is the real problem, it became apparent only thanks to your stress test, it is actually pretty unfortunate that we understood the problem at the testing phase. Already started the refactoring.

Comment by Lingzhi Deng [ 18/Dec/20 ]

In our discussion with Cloud, we decided to support <= 50 concurrent migrations on a single recipient set (at least for private beta). And we will revisit this and refactor the OplogFetcher and the cloners code if needed. So I think we can change the thread pool to be 150 for now.

Comment by Suganthi Mani [ 01/Dec/20 ]

I think, we need a solution, something like, throttle the migration at the command layer on the recipient side (i.e) Make recipeintSyncData cmd to wait if there are already 3 active concurrent migration in progress (for a default thread pool size 8) before asking POS to start a new migration

Generated at Thu Feb 08 05:30:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.