-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v5.0
-
Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
-
174
mongod can very rarely segfault in signalOplogWaiters(), apparently due to a race with loadCatalog. I suspect signalOplogWaiters is being called without the Global IS lock, so its access to the LocalOplogInfo is unsafe. The crash is reproduced only in a virtualized environment that fuzzes network events, not observed in production or Evergreen. It happens toward the end of rollback.
In an example crash, the ReplCoord-6 thread segfaults with this backtrace:
Thread 1 (Thread 0x7f51cba6e700 (LWP 91)): mongo::repl::signalOplogWaiters () at src/mongo/db/repl/oplog.cpp:1828 mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters (this=0x564124103600, committedOpTime=...) at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:995 mongo::repl::ReplicationCoordinatorImpl::_advanceCommitPoint (this=0x56412417b800, lk=..., committedOpTimeAndWallTime=..., fromSyncSource=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl.cpp:5060 mongo::repl::ReplicationCoordinatorImpl::_handleHeartbeatResponse (this=<optimized out>, cbData=...) at src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp:237 std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)>::operator()(mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&) const (this=0x5641299a69a0, __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687 mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4::operator()(mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) const (this=0x5641299a69a0, args=...) at src/mongo/executor/task_executor.cpp:244 std::_Function_handler<void (mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&), mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4>::_M_invoke(std::_Any_data const&, mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) (__functor=..., __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297
Meanwhile the "RECOVERY" thread is in loadCatalog:
Thread 68 (Thread 0x7f51ce374700 (LWP 83)): __lll_timedwait_tid () from /lib/x86_64-linux-gnu/libpthread.so.0 ?? () ?? () ?? () ?? () .L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so ?? () ?? () ?? () .L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so mongo::latch_detail::Mutex::_onContendedLock (this=0x56412417b820) at src/mongo/platform/mutex.cpp:106 std::lock_guard<mongo::latch_detail::Latch>::lock_guard (this=<optimized out>, __m=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:162 mongo::repl::ReplicationCoordinatorImpl::getMemberState (this=0x7f51ce370e30) at src/mongo/db/repl/replication_coordinator_impl.cpp:1006 mongo::WiredTigerKVEngine::getGroupedRecordStore (this=0x7f5204392d5a <mongo::latch_detail::Mutex::lock()+74>, opCtx=<optimized out>, ns=..., ident="collection-4--4508029134355656785", options=..., prefix=...) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:1502 mongo::StorageEngineImpl::_initCollection (this=0x564124199c00, opCtx=0x2, catalogId=..., nss=..., forRepair=false, minVisibleTs=...) at src/mongo/db/storage/storage_engine_impl.cpp:322 mongo::StorageEngineImpl::loadCatalog (this=0x564124199c00, opCtx=0x5641299ae500, loadingFromUncleanShutdown=false) at src/mongo/db/storage/storage_engine_impl.cpp:288 mongo::catalog::openCatalog (opCtx=0x80, minVisibleTimestampMap=std::map with 22 elements = {...}, stableTimestamp=...) at src/mongo/db/catalog/catalog_control.cpp:122 mongo::StorageEngineImpl::recoverToStableTimestamp (this=0x564124199c00, opCtx=0x5641299ae500) at src/mongo/db/storage/storage_engine_impl.cpp:1008 non-virtual thunk to mongo::StorageEngineImpl::recoverToStableTimestamp(mongo::OperationContext*) () at src/mongo/util/invariant.h:71 mongo::repl::StorageInterfaceImpl::recoverToStableTimestamp (this=<optimized out>, opCtx=0x5641299ae500) at src/mongo/db/repl/storage_interface_impl.cpp:1315 mongo::repl::RollbackImpl::_recoverToStableTimestamp (this=<optimized out>, opCtx=<optimized out>) at src/mongo/db/repl/rollback_impl.cpp:1215 mongo::repl::RollbackImpl::_runPhaseFromAbortToReconstructPreparedTxns (this=0x564129779a00, opCtx=0x5641299ae500, commonPoint=...) at src/mongo/db/repl/rollback_impl.cpp:503 mongo::repl::RollbackImpl::runRollback (this=0x564129779a00, opCtx=0x5641299ae500) at src/mongo/db/repl/rollback_impl.cpp:239 mongo::repl::BackgroundSync::_runRollbackViaRecoverToCheckpoint(mongo::OperationContext*, mongo::HostAndPort const&, mongo::repl::OplogInterface*, mongo::repl::StorageInterface*, std::function<mongo::DBClientBase* ()>) (this=0x564128d25380, opCtx=0x5641299ae500, source=..., localOplog=0x7f51ce372690, storageInterface=0x5641241654e0, getConnection=...) at src/mongo/db/repl/bgsync.cpp:826 mongo::repl::BackgroundSync::_runRollback (this=0x564128d25380, opCtx=0x5641299ae500, fetcherReturnStatus=..., source=..., requiredRBID=2, storageInterface=0x5641241654e0) at src/mongo/db/repl/bgsync.cpp:788 mongo::repl::BackgroundSync::_produce (this=<optimized out>) at src/mongo/db/repl/bgsync.cpp:603 mongo::repl::BackgroundSync::_runProducer (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:263 mongo::repl::BackgroundSync::_run (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:218
The source for signalOplogWaiters is:
void signalOplogWaiters() { const auto& oplog = LocalOplogInfo::get(getGlobalServiceContext())->getCollection(); if (oplog) { oplog->getCappedCallback()->notifyCappedWaitersIfNeeded(); // <- segfault } }
By reading its disassembly and seeing exactly where it crashed in GDB, I think it crashes calling notifyCappedWaitersIfNeeded(); perhaps the return value of getCappedCallback() is invalid, or refers to an object invalidated after it returns. (I'm not fluent in assembly.)