[SERVER-48650] Unit tests' ServiceContext's NetworkInterfaceMockClockSource cannot continue to depend upon the lifetime of the ReplicationCoordinator to remain valid Created: 08/Jun/20  Updated: 29/Oct/23  Resolved: 28/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.7.0, 4.4.5

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Benjamin Caimano (Inactive)
Resolution: Fixed Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-46826 Instantiate the JournalFlusher thread... Closed
Problem/Incident
causes SERVER-63103 ClockSourceMock::get has no implement... Closed
Related
related to SERVER-47892 DiagnosticInfo for latches doesn't in... Closed
related to SERVER-49950 Address issues with ClockSource::setA... Closed
is related to SERVER-49863 ServiceContext time sources should ei... Backlog
is related to SERVER-49864 Tests with ServiceContext use should ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Service arch 2020-06-29, Service arch 2020-07-13, Service Arch 2020-07-27, Service Arch 2020-08-10
Participants:

 Description   

:tldr

The ServiceContext's NetworkInterfaceMockClockSource cannot continue to depend upon the lifetime of the ReplicationCoordinator to remain valid; while also being depended upon by any code with a mutex under contention via the DiagnosticInfo code.

There's a rare (so far) BF dependent on a SERVER ticket linked a couple hops from this one.

--------------------------------------------------------------------------

I tried to setup the JournalFlusher thread in our unit test test fixtures, specifically the ServiceContexMongoDTest test fixture. However, I ran into problems like so,

[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.003+0000 | 2020-06-05T20:47:57.003Z I  STORAGE  22320   [main] "Shutting down journal flusher thread"
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.004+0000 src/mongo/executor/network_interface_mock.h:567:22: runtime error: member call on address 0x61600038ee80 which does not point to an object of type 'mongo::executor::NetworkInterfaceMock'
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.004+0000 0x61600038ee80: note: object has a possibly invalid vptr: abs(offset to top) too big
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.004+0000  a4 00 00 48  8b 00 00 49 4d 56 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.004+0000               ^~~~~~~~~~~~~~~~~~~~~~~
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:47:57.004+0000               possibly invalid vptr
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.642+0000     #0 0x564d4e99e90a in mongo::executor::NetworkInterfaceMockClockSource::now() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/executor/network_interface_mock.h:567:22
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.652+0000     #1 0x564d512227b1 in mongo::DiagnosticInfo::capture(mongo::StringData const&, mongo::DiagnosticInfo::Options) /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/util/diagnostic_info.cpp:285:73
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.652+0000     #2 0x564d512250a9 in mongo::(anonymous namespace)::_mongoInitializerFunction_DiagnosticInfo(mongo::InitializerContext*)::DiagnosticListener::onContendedLock(mongo::latch_detail::Identity const&) /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/util/diagnostic_info.cpp:192:43
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.653+0000     #3 0x564d532adfe6 in mongo::latch_detail::Mutex::_onContendedLock() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/platform/mutex.cpp:89:19
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.653+0000     #4 0x564d532ad29e in mongo::latch_detail::Mutex::lock() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/platform/mutex.cpp:55:5
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.678+0000     #5 0x564d4e36c544 in std::unique_lock<mongo::latch_detail::Latch>::lock() /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:267:17
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.678+0000     #6 0x564d4e36beba in std::unique_lock<mongo::latch_detail::Latch>::unique_lock(mongo::latch_detail::Latch&) /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:197:2
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.682+0000     #7 0x564d4fe2d1de in mongo::JournalFlusher::run() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/db/storage/control/journal_flusher.cpp:127:34
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #8 0x564d52b09f7d in mongo::BackgroundJob::jobBody() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/util/background.cpp:160:5
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #9 0x564d52b12375 in std::__invoke_result<mongo::BackgroundJob::go()::$_0>::type std::__invoke<mongo::BackgroundJob::go()::$_0>(mongo::BackgroundJob::go()::$_0&&) /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/invoke.h:95:14
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #10 0x564d52b12375 in decltype(auto) std::__apply_impl<mongo::BackgroundJob::go()::$_0, std::tuple<> >(mongo::BackgroundJob::go()::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long>) /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/tuple:1678
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #11 0x564d52b12375 in decltype(auto) std::apply<mongo::BackgroundJob::go()::$_0, std::tuple<> >(mongo::BackgroundJob::go()::$_0&&, std::tuple<>&&) /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/tuple:1687
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #12 0x564d52b12375 in mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, 0>(mongo::BackgroundJob::go()::$_0)::'lambda'()::operator()() /data/mci/c67a30324c4da5f0e5f82678065dd299/src/src/mongo/stdx/thread.h:186
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.685+0000     #13 0x564d52b12375 in mongo::BackgroundJob::go()::$_0 std::__invoke_impl<void, mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, 0>(mongo::BackgroundJob::go()::$_0)::'lambda'()>(std::__invoke_other, mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, 0>(mongo::BackgroundJob::go()::$_0)::'lambda'()&&) /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.U0D/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/invoke.h:60
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.686+0000     #14 0x564d535a52ee in execute_native_thread_routine /data/mci/ff40e18d703ff1f6e8f8ecf75ad7202f/toolchain-builder/tmp/build-gcc-v3.sh-NJq/build/x86_64-mongodb-linux/libstdc++-v3/src/c++11/../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80:18
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:06.686+0000     #15 0x7f8e0e26b6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:09.525+0000     #16 0x7f8e0dd7c88e in clone /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:09.525+0000 
[cpp_unit_test:db_repl_coordinator_test] 2020-06-05T20:48:09.525+0000 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/mongo/executor/network_interface_mock.h:567:22 in

It turns out that a contend mutex acquisition goes into the DiagnosticInfo code, which in turn call into the NetworkInterfaceMockClockSource and tries to use the NetworkInterfaceMock. However, the JournalFlusher (which has a mutex) found the NetworkInterfaceMock to be invalid memory: the NetworkInterfaceMock was already destroyed.

The test fixture of the failing test is ReplCoordTest. The repl test fixture creates a NetworkInterfaceMock instances, passes ownership of the NetworkInterface into a ThreadPoolTaskExecutor instance, whose ownership is then passed into a ReplicationCoordinatorImpl instance. Meanwhile, NetworkInterfaceMockClockSource instances are created, with pointers to the same NetworkInterfaceMock owned by the ReplicationCoordinatorImpl, and set on the ServiceContext. So things like the JournalFlusher that use Mutexes are now dependent on the the lifetime of the ReplicationCoordinatorImpl.

The ServiceContext's NetworkInterfaceMockClockSource cannot continue to depend upon the lifetime of the ReplicationCoordinator to remain valid; while also being depended upon by any code with a mutex under contention.



 Comments   
Comment by Githook User [ 23/Feb/21 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-48650 Gave the ClockSourceMock a global impl

(cherry picked from commit 214379825c248f5a5e5f0a01ad9863b900faaf30)
Branch: v4.4
https://github.com/mongodb/mongo/commit/569ba8d82148df4e3b6faee4f285f563945da139

Comment by Benjamin Caimano (Inactive) [ 28/Jul/20 ]

dianna.hohensee, I've landed code that removes the NetworkInterfaceMockClockSource in favor of an immortal ClockSourceMock implementation. Hopefully this should unblock you! Let me know if I missed the mark somehow.

Comment by Githook User [ 28/Jul/20 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-48650 Gave the ClockSourceMock a global impl
Branch: master
https://github.com/mongodb/mongo/commit/214379825c248f5a5e5f0a01ad9863b900faaf30

Comment by Benjamin Caimano (Inactive) [ 09/Jun/20 ]

In my mind, there are two parts to this:

  • NetworkInterfaceMockClockSource should actually be a ClockSourceMock. The classes are extremely similar. We can checked_cast<ClockSourceMock>() the service context's ClockSource for use in the NetworkInterfaceMock. This would move the lifetime dependency to the lowest level test fixture.
  • The ServiceContextTest fixture (and related fixtures) should have some ability to determine clock mocking on construction instead of setUp(). I could see this going a few different ways. The easiest is probably adding a tag dispatching constructor to ServiceContextTest that starts with all clocks/tick sources mocked. That's just what comes to mind, I'm sure there are other solutions. The important bit is that we guarantee from the beginning of setUp() to the end of tearDown() we have the same clock/tick sources.
Generated at Thu Feb 08 05:17:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.