-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 8.1.0-rc0, 8.0.0, 8.2.0-rc0
-
Component/s: None
-
None
-
Replication
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
SERVER-92236 and more generally SERVER-92333 describe the problem where if a cancellation source is long-lived and cancelToken.onCancel() futures keep getting associated to it, it creates what is effectively a memory leak, as those futures will consume memory which doesn't get released under normal circumstances.
It appears that TransactionCoordinator is vulnerable to this problem:
- TransactionCoordinatorService is a singleton that creates a single cancellation source on step up, which only gets cancelled on step down.
- The cancellation source is used to create a token when constructing a TransactionCoordinator.
- That cancellation token later gets passed to WaitForMajorityService::waitUntilMajorityForWrite.
- WaitForMajorityService then associates a future to the cancellation token.
This leak happens for every coordinated (i.e. multi-shard) transaction.
The stack trace associated to the leak is:
tcmalloc::tcmalloc_internal::SampleifyAllocation() tcmalloc::tcmalloc_internal::alloc_small_sampled_hooks_or_perthread<>() mongo::future_details::SharedStateImpl<>::addChild() mongo::CancellationToken::onCancel() mongo::WaitForMajorityServiceImplBase::waitUntilMajority() mongo::(anonymous namespace)::waitForMajorityWithHangFailpoint() mongo::unique_function<>::makeImpl<>()::SpecificImpl::call() mongo::future_details::FutureImpl<>::then<>()::{lambda()#1}::operator()() mongo::future_details::FutureImpl<>::generalImpl<>() mongo::Promise<>::setWith<>() mongo::unique_function<>::makeImpl<>()::SpecificImpl::call() mongo::unique_function<>::makeImpl<>()::SpecificImpl::call() mongo::executor::ThreadPoolTaskExecutor::runCallback() mongo::unique_function<>::makeImpl<>()::SpecificImpl::call() mongo::ThreadPool::Impl::_doOneTask() mongo::ThreadPool::Impl::_consumeTasks() mongo::ThreadPool::Impl::_workerThreadBody() std::thread::_State_impl<>::_M_run() execute_native_thread_routine
A reproducer is attached.
- is caused by
-
SERVER-73915 TransactionCoordinatorService may stall primary step-up from completing when replica set shard steps down and back up quickly
-
- Closed
-
- related to
-
SERVER-92236 Chunk migrations should use short lived cancellation sources
-
- Backlog
-
-
SERVER-92333 Audit use of long lived CancellationSources
-
- Backlog
-
-
SERVER-103945 Better memory management for CancellationToken
-
- Needs Scheduling
-