Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.2.0-rc0
Affects Version/s: 8.1.0-rc0, 8.0.0, 8.2.0-rc0
Component/s: Replication
Labels:
- repl-shortlist

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.1, v8.0, v7.0
Sprint:
Repl 2025-05-12, CAR Team 2025-06-09
Case:
Linked BF Score:
200
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

~~SERVER-92236~~ and more generally SERVER-92333 describe the problem where if a cancellation source is long-lived and cancelToken.onCancel() futures keep getting associated to it, it creates what is effectively a memory leak, as those futures will consume memory which doesn't get released under normal circumstances.

It appears that TransactionCoordinator is vulnerable to this problem:

TransactionCoordinatorService is a singleton that creates a single cancellation source on step up, which only gets cancelled on step down.
The cancellation source is used to create a token when constructing a TransactionCoordinator.
That cancellation token later gets passed to WaitForMajorityService::waitUntilMajorityForWrite.
WaitForMajorityService then associates a future to the cancellation token.

This leak happens for every coordinated (i.e. multi-shard) transaction.

The stack trace associated to the leak is:

tcmalloc::tcmalloc_internal::SampleifyAllocation()
tcmalloc::tcmalloc_internal::alloc_small_sampled_hooks_or_perthread<>()
mongo::future_details::SharedStateImpl<>::addChild()
mongo::CancellationToken::onCancel()
mongo::WaitForMajorityServiceImplBase::waitUntilMajority()
mongo::(anonymous namespace)::waitForMajorityWithHangFailpoint()
mongo::unique_function<>::makeImpl<>()::SpecificImpl::call()
mongo::future_details::FutureImpl<>::then<>()::{lambda()#1}::operator()()
mongo::future_details::FutureImpl<>::generalImpl<>()
mongo::Promise<>::setWith<>()
mongo::unique_function<>::makeImpl<>()::SpecificImpl::call()
mongo::unique_function<>::makeImpl<>()::SpecificImpl::call()
mongo::executor::ThreadPoolTaskExecutor::runCallback()
mongo::unique_function<>::makeImpl<>()::SpecificImpl::call()
mongo::ThreadPool::Impl::_doOneTask()
mongo::ThreadPool::Impl::_consumeTasks()
mongo::ThreadPool::Impl::_workerThreadBody()
std::thread::_State_impl<>::_M_run()
execute_native_thread_routine

A reproducer is attached.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

SERVER-103841-repro.js
2 kB
Apr 15 2025 04:17:15 PM UTC

is caused by

SERVER-73915 TransactionCoordinatorService may stall primary step-up from completing when replica set shard steps down and back up quickly

Closed

related to

SERVER-92236 Chunk migrations should use short lived cancellation sources

Closed

SERVER-92333 Audit use of long lived CancellationSources

Open

SERVER-103945 Better memory management for CancellationToken

Backlog

SERVER-103481 Give more granular ownership to subset of files owned by 10gen/query in 'db/', 'db/commands', 'db/s', and 'db/test_output'

Closed

Assignee:: Myles Hathcock
Reporter:: Joan Bruguera Micó
Participants:: Githook User, Joan Bruguera Micó, Myles Hathcock
Votes:: 0 Vote for this issue
Watchers:: 29 Start watching this issue

Created:: Apr 15 2025 04:13:17 PM UTC
Updated:: Jun 26 2025 07:37:00 AM UTC
Resolved:: May 07 2025 01:29:15 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates