[SERVER-76621] Thread pool task executor can cause memory leak when handling exhaust command. Created: 27/Apr/23  Updated: 29/Oct/23  Resolved: 02/Jun/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 6.0.7, 7.0.0-rc3

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0
Sprint: Service Arch 2023-05-15, Service Arch 2023-05-29, Service Arch 2023-06-12
Participants:
Linked BF Score: 0

 Description   

Thread pool task executor when it handles exhaust commands can leak resources. We noticed the leakage with SingleServerDiscoveryMonitor which is the only consumer at this moment which uses thread pool task executor to schedule exhaust commands . Below is the racy sequence with exhaust command workflow.
 
1.  Thread 1: Starts processing the streamable hello cmd response  and runs line 842, and returns true (i.e, cbState is neither canceled nor finished at this point);  and the thread pauses.

2. Thread 2: Cancels the exhaust command which will cancel the cbState and  resets the callback in cbstate  to release any  resources the callback is holding.

3. Thread 1: Thread resumes, then acquires the executor mutex lock and sets the callback in  cbState   back to the original callback function. And, this can cause memory leak.

In the BF, we were ending up with indirect leak due to this circular reference:  SingleServerDiscoveryMonitor holds a reference to TaskExecutor::CallbackHandle and which in-turn holds a reference to ThreadPoolTaskExecutor::CallbackState  and that contains CallbackFn which holds a reference to SingleServerDiscoveryMonitor.



 Comments   
Comment by Githook User [ 07/Jun/23 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-76621 Clarify ThreadPoolTaskExecutor mutex acquisition rules

(cherry picked from commit 00ae304e83e4e2091262fcda04fa2bc8b8075bbd)
Branch: v6.0
https://github.com/mongodb/mongo/commit/b35e368f862fed371fdad1f3cd75920c82c37547

Comment by Githook User [ 05/Jun/23 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-76621 Clarify ThreadPoolTaskExecutor mutex acquisition rules

(cherry picked from commit 00ae304e83e4e2091262fcda04fa2bc8b8075bbd)
Branch: v7.0
https://github.com/mongodb/mongo/commit/9d961b95ce79500808bbe19131142d8a2d6b60bc

Comment by Githook User [ 01/Jun/23 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-76621 Clarify ThreadPoolTaskExecutor mutex acquisition rules
Branch: master
https://github.com/mongodb/mongo/commit/00ae304e83e4e2091262fcda04fa2bc8b8075bbd

Comment by Suganthi Mani [ 27/Apr/23 ]

Two ways to fix the problem: 

1) The fix is to run this block in a single critical section

2) Break the cycle, by reseting the  handles after shutting down the singleServerDiscoveryMonitor.

I prefer fix#1 as I feel fix#2 is sub-optimal as there are other ways to cancel the requests scheduled in thread pool task executor.

To be noted, we can also get into similar leak problem when callback to last response in the exhaust cursor  races with callback to previous response. I see lot of fragile places in thread pool task executor, example, if blocks like this where we may do some work thinking that if check is still valid.

Generated at Thu Feb 08 06:33:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.