[SERVER-56354] Investigate memory leaks in TenantMigrationDonorAccessBlocker Created: 26/Apr/21  Updated: 29/Oct/23  Resolved: 05/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0-rc1, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-52713 [testing] Add stepdown/kill/terminate... Closed
Related
related to SERVER-57368 Clear tenant migration access blocker... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.9
Sprint: Sharding 2021-05-17
Participants:

 Description   

Several memory leaks were caught in the CleanEveryN hook after shutting down the donor primary. They seem to be related to the TenantMigrationDonorAccessBlocker.

This was caught while running patch builds to enable the tenant migration stepdown passthrough.



 Comments   
Comment by Githook User [ 23/May/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-56354 Remove extra argument for startMigration() function call
Branch: master
https://github.com/mongodb/mongo/commit/7430d4381777380936d5ac17342dd85be488ef8e

Comment by Githook User [ 05/May/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-56354 Ensure that the future chain in checkIfCanReadOrBlock gets interrupted when the command's opCtx is marked as killed

(cherry picked from commit 589020d961aadba1d38ea88d0b3f4b1878bf3be0)
Branch: v4.9
https://github.com/mongodb/mongo/commit/202937fa78084e21b17af47d1dc2af1b9be5a38f

Comment by Githook User [ 05/May/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-56354 Ensure that the future chain in checkIfCanReadOrBlock gets interrupted when the command's opCtx is marked as killed
Branch: master
https://github.com/mongodb/mongo/commit/589020d961aadba1d38ea88d0b3f4b1878bf3be0

Comment by Jack Mulrow [ 28/Apr/21 ]

From a first look, I think this has to do with the callback for the future returned by tenant_migration_access_blocker_util::checkIfCanReadOrBlock() and the lifetime management for the TenantMigrationAccessBlockerExecutor decoration executor that it runs on. That executor has an atypical lifetime where the decoration holds a weak pointer to it and every access blocker holds it by shared pointer, creating it on construction if the executor didn't already exist. The callback returned by that util method captures access blockers by shared pointer, so the executor won't be destructed and shutdown until the callback runs, which I think never happened in this failure.

Based on the logs, it looks like the CleanEveryN hook tried to shut down a server and while shutting down, the server's ServiceEntryPointImpl timed out waiting for its active sessions to drain. My guess is at least one of those sessions represented a command that was waiting on a tenant migration to complete, which prevented the callback in the checkIfCanReadOrBlock() future chain from running, so when the process finished shutting down the memory allocated for that callback was considered leaked.

From the logs:

// Node0 starts to shut down.
[j0:rs0:n0] | 2021-04-26T15:32:37.473+00:00 I  CONTROL  23377   [SignalHandler] "Received signal","attr":{"signal":15,"error":"Terminated"}
...
// Shutdown continuous without shutting down all worker threads.
[j0:rs0:n0] | 2021-04-26T15:32:48.487+00:00 I  NETWORK  22947   [SignalHandler] "shutdown: exhausted grace period active workers to drain; continuing with shutdown...","attr":{"workers":1}
[j0:rs0:n0] | 2021-04-26T15:32:48.487+00:00 I  NETWORK  4907200 [SignalHandler] "Failed to shutdown ServiceExecutorSynchronous","attr":{"error":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"passthrough executor couldn't shutdown all worker threads within time limit."}}
...
// After shutdown finishes, leaks are detected involving a donor access blocker and the canRead future.
[j0:rs0:n0] =================================================================
[j0:rs0:n0] ==108062==ERROR: LeakSanitizer: detected memory leaks
...
[j0:rs0:n0] Indirect leak of 408 byte(s) in 1 object(s) allocated from:
[j0:rs0:n0]     #0 0x556204f00a72 in operator new(unsigned long) /data/mci/2b29d50424b17e9bd64c23288b01e972/toolchain-builder/tmp/build-llvm.sh-h2X/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:106:3
[j0:rs0:n0]     #1 0x556208f94587 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/ext/new_allocator.h:111:27
[j0:rs0:n0]     #2 0x556208f94587 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/alloc_traits.h:436
[j0:rs0:n0]     #3 0x556208f94587 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/allocated_ptr.h:97
[j0:rs0:n0]     #4 0x556208f94587 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(mongo::TenantMigrationDonorAccessBlocker*&, std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr_base.h:675
[j0:rs0:n0]     #5 0x556208f94587 in std::__shared_ptr<mongo::TenantMigrationDonorAccessBlocker, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr_base.h:1342
[j0:rs0:n0]     #6 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker>::shared_ptr<std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:359:4
[j0:rs0:n0]     #7 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker> std::allocate_shared<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::allocator<mongo::TenantMigrationDonorAccessBlocker> const&, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:705
[j0:rs0:n0]     #8 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker> std::make_shared<mongo::TenantMigrationDonorAccessBlocker, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:721
[j0:rs0:n0]     #9 0x556208faa0ba in mongo::repl::(anonymous namespace)::onTransitionToAbortingIndexBuilds(mongo::OperationContext*, mongo::TenantMigrationDonorDocument const&) /data/mci/e8e785cb782524c554ebc2073515df69/src/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp:56
[j0:rs0:n0]     #10 0x556208faa0ba in mongo::repl::TenantMigrationDonorOpObserver::onInserts(mongo::OperationContext*, mongo::NamespaceString const&, boost::optional<mongo::UUID>, __gnu_cxx::__normal_iterator<mongo::InsertStatement const*, std::vector<mongo::InsertStatement, std::allocator<mongo::InsertStatement> > >, __gnu_cxx::__normal_iterator<mongo::InsertStatement const*, std::vector<mongo::InsertStatement, std::allocator<mongo::InsertStatement> > >, bool) /data/mci/e8e785cb782524c554ebc2073515df69/src/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp:185
...
[j0:rs0:n0] Indirect leak of 296 byte(s) in 1 object(s) allocated from:
[j0:rs0:n0]     #0 0x556204f00a72 in operator new(unsigned long) /data/mci/2b29d50424b17e9bd64c23288b01e972/toolchain-builder/tmp/build-llvm.sh-h2X/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:106:3
[j0:rs0:n0]     #1 0x5562061dbd62 in boost::intrusive_ptr<mongo::future_details::SharedStateImpl<std::vector<mongo::Status, std::allocator<mongo::Status> > > > mongo::make_intrusive<mongo::future_details::SharedStateImpl<std::vector<mongo::Status, std::allocator<mongo::Status> > >, void>() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/intrusive_counter.h:100:16
[j0:rs0:n0]     #2 0x5562061dbd62 in mongo::Promise<std::vector<mongo::Status, std::allocator<mongo::Status> > >::makePromiseFutureImpl()::PromiseAndFuture::PromiseAndFuture() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/future.h:867
[j0:rs0:n0]     #3 0x5562061dbd62 in mongo::Promise<std::vector<mongo::Status, std::allocator<mongo::Status> > >::makePromiseFutureImpl() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/future.h:870
[j0:rs0:n0]     #4 0x556208f7c1d4 in auto mongo::makePromiseFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >() /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/util/future.h:1157:12
[j0:rs0:n0]     #5 0x556208f7c1d4 in mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > > mongo::future_util::withCancellation<mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >, std::vector<mongo::Status, std::allocator<mongo::Status> > >(mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >&&, mongo::CancellationToken const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/util/future_util.h:671
[j0:rs0:n0]     #6 0x556208f6284c in mongo::tenant_migration_access_blocker::checkIfCanReadOrBlock(mongo::OperationContext*, mongo::OpMsgRequest const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_access_blocker_util.cpp:186:12
...
// Other leaks related to the access blocker or that callback.
[j0:rs0:n0]     #4 0x556208f9660c in mongo::RepeatableSharedPromise<void>::RepeatableSharedPromise() /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_donor_access_blocker.h:56
...
[j0:rs0:n0]     #8 0x556208f5f6c4 in mongo::tenant_migration_access_blocker::checkIfCanReadOrBlock(mongo::OperationContext*, mongo::OpMsgRequest const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_access_blocker_util.cpp:146

If this is true, I don't think this leak represents a serious problem, since it should only happen on shutdown with an active migration, which should be rare, and the leaked memory should be immediately reclaimed by the OS when shutdown finishes. We'd still want to fix this to avoid BFs though. One way could be to change shutting down the access blocker registry to also shut down the access blocker executor, if it exists.

We should also figure out why the callbacks weren't able to run, and if it is because of a hung migration, we should investigate how that happened. I noticed the test immediately before CleanEveryN failed timed out (or at least ran for a very long time), so maybe something went wrong there.

Generated at Thu Feb 08 05:39:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.