|
From a first look, I think this has to do with the callback for the future returned by tenant_migration_access_blocker_util::checkIfCanReadOrBlock() and the lifetime management for the TenantMigrationAccessBlockerExecutor decoration executor that it runs on. That executor has an atypical lifetime where the decoration holds a weak pointer to it and every access blocker holds it by shared pointer, creating it on construction if the executor didn't already exist. The callback returned by that util method captures access blockers by shared pointer, so the executor won't be destructed and shutdown until the callback runs, which I think never happened in this failure.
Based on the logs, it looks like the CleanEveryN hook tried to shut down a server and while shutting down, the server's ServiceEntryPointImpl timed out waiting for its active sessions to drain. My guess is at least one of those sessions represented a command that was waiting on a tenant migration to complete, which prevented the callback in the checkIfCanReadOrBlock() future chain from running, so when the process finished shutting down the memory allocated for that callback was considered leaked.
From the logs:
// Node0 starts to shut down.
|
[j0:rs0:n0] | 2021-04-26T15:32:37.473+00:00 I CONTROL 23377 [SignalHandler] "Received signal","attr":{"signal":15,"error":"Terminated"}
|
...
|
// Shutdown continuous without shutting down all worker threads.
|
[j0:rs0:n0] | 2021-04-26T15:32:48.487+00:00 I NETWORK 22947 [SignalHandler] "shutdown: exhausted grace period active workers to drain; continuing with shutdown...","attr":{"workers":1}
|
[j0:rs0:n0] | 2021-04-26T15:32:48.487+00:00 I NETWORK 4907200 [SignalHandler] "Failed to shutdown ServiceExecutorSynchronous","attr":{"error":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"passthrough executor couldn't shutdown all worker threads within time limit."}}
|
...
|
// After shutdown finishes, leaks are detected involving a donor access blocker and the canRead future.
|
[j0:rs0:n0] =================================================================
|
[j0:rs0:n0] ==108062==ERROR: LeakSanitizer: detected memory leaks
|
...
|
[j0:rs0:n0] Indirect leak of 408 byte(s) in 1 object(s) allocated from:
|
[j0:rs0:n0] #0 0x556204f00a72 in operator new(unsigned long) /data/mci/2b29d50424b17e9bd64c23288b01e972/toolchain-builder/tmp/build-llvm.sh-h2X/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:106:3
|
[j0:rs0:n0] #1 0x556208f94587 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/ext/new_allocator.h:111:27
|
[j0:rs0:n0] #2 0x556208f94587 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/alloc_traits.h:436
|
[j0:rs0:n0] #3 0x556208f94587 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > > std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> > >(std::allocator<std::_Sp_counted_ptr_inplace<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, (__gnu_cxx::_Lock_policy)2> >&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/allocated_ptr.h:97
|
[j0:rs0:n0] #4 0x556208f94587 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(mongo::TenantMigrationDonorAccessBlocker*&, std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr_base.h:675
|
[j0:rs0:n0] #5 0x556208f94587 in std::__shared_ptr<mongo::TenantMigrationDonorAccessBlocker, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr_base.h:1342
|
[j0:rs0:n0] #6 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker>::shared_ptr<std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::_Sp_alloc_shared_tag<std::allocator<mongo::TenantMigrationDonorAccessBlocker> >, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:359:4
|
[j0:rs0:n0] #7 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker> std::allocate_shared<mongo::TenantMigrationDonorAccessBlocker, std::allocator<mongo::TenantMigrationDonorAccessBlocker>, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::allocator<mongo::TenantMigrationDonorAccessBlocker> const&, mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:705
|
[j0:rs0:n0] #8 0x556208faa0ba in std::shared_ptr<mongo::TenantMigrationDonorAccessBlocker> std::make_shared<mongo::TenantMigrationDonorAccessBlocker, mongo::ServiceContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(mongo::ServiceContext*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&) /opt/mongodbtoolchain/revisions/32eb70c47bd9e9759dd05654843feb80461aaef3/stow/gcc-v3.pgi/lib/gcc/x86_64-mongodb-linux/8.3.0/../../../../include/c++/8.3.0/bits/shared_ptr.h:721
|
[j0:rs0:n0] #9 0x556208faa0ba in mongo::repl::(anonymous namespace)::onTransitionToAbortingIndexBuilds(mongo::OperationContext*, mongo::TenantMigrationDonorDocument const&) /data/mci/e8e785cb782524c554ebc2073515df69/src/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp:56
|
[j0:rs0:n0] #10 0x556208faa0ba in mongo::repl::TenantMigrationDonorOpObserver::onInserts(mongo::OperationContext*, mongo::NamespaceString const&, boost::optional<mongo::UUID>, __gnu_cxx::__normal_iterator<mongo::InsertStatement const*, std::vector<mongo::InsertStatement, std::allocator<mongo::InsertStatement> > >, __gnu_cxx::__normal_iterator<mongo::InsertStatement const*, std::vector<mongo::InsertStatement, std::allocator<mongo::InsertStatement> > >, bool) /data/mci/e8e785cb782524c554ebc2073515df69/src/src/mongo/db/repl/tenant_migration_donor_op_observer.cpp:185
|
...
|
[j0:rs0:n0] Indirect leak of 296 byte(s) in 1 object(s) allocated from:
|
[j0:rs0:n0] #0 0x556204f00a72 in operator new(unsigned long) /data/mci/2b29d50424b17e9bd64c23288b01e972/toolchain-builder/tmp/build-llvm.sh-h2X/llvm/projects/compiler-rt/lib/asan/asan_new_delete.cc:106:3
|
[j0:rs0:n0] #1 0x5562061dbd62 in boost::intrusive_ptr<mongo::future_details::SharedStateImpl<std::vector<mongo::Status, std::allocator<mongo::Status> > > > mongo::make_intrusive<mongo::future_details::SharedStateImpl<std::vector<mongo::Status, std::allocator<mongo::Status> > >, void>() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/intrusive_counter.h:100:16
|
[j0:rs0:n0] #2 0x5562061dbd62 in mongo::Promise<std::vector<mongo::Status, std::allocator<mongo::Status> > >::makePromiseFutureImpl()::PromiseAndFuture::PromiseAndFuture() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/future.h:867
|
[j0:rs0:n0] #3 0x5562061dbd62 in mongo::Promise<std::vector<mongo::Status, std::allocator<mongo::Status> > >::makePromiseFutureImpl() /data/mci/e95cf945ee841160d390a38753fd1a5d/src/src/mongo/util/future.h:870
|
[j0:rs0:n0] #4 0x556208f7c1d4 in auto mongo::makePromiseFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >() /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/util/future.h:1157:12
|
[j0:rs0:n0] #5 0x556208f7c1d4 in mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > > mongo::future_util::withCancellation<mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >, std::vector<mongo::Status, std::allocator<mongo::Status> > >(mongo::SemiFuture<std::vector<mongo::Status, std::allocator<mongo::Status> > >&&, mongo::CancellationToken const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/util/future_util.h:671
|
[j0:rs0:n0] #6 0x556208f6284c in mongo::tenant_migration_access_blocker::checkIfCanReadOrBlock(mongo::OperationContext*, mongo::OpMsgRequest const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_access_blocker_util.cpp:186:12
|
...
|
// Other leaks related to the access blocker or that callback.
|
[j0:rs0:n0] #4 0x556208f9660c in mongo::RepeatableSharedPromise<void>::RepeatableSharedPromise() /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_donor_access_blocker.h:56
|
...
|
[j0:rs0:n0] #8 0x556208f5f6c4 in mongo::tenant_migration_access_blocker::checkIfCanReadOrBlock(mongo::OperationContext*, mongo::OpMsgRequest const&) /data/mci/48766b764fac89573657919c5b817f31/src/src/mongo/db/repl/tenant_migration_access_blocker_util.cpp:146
|
If this is true, I don't think this leak represents a serious problem, since it should only happen on shutdown with an active migration, which should be rare, and the leaked memory should be immediately reclaimed by the OS when shutdown finishes. We'd still want to fix this to avoid BFs though. One way could be to change shutting down the access blocker registry to also shut down the access blocker executor, if it exists.
We should also figure out why the callbacks weren't able to run, and if it is because of a hung migration, we should investigate how that happened. I noticed the test immediately before CleanEveryN failed timed out (or at least ran for a very long time), so maybe something went wrong there.
|