Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.3.0, 5.0.7
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0
Sprint:
Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
Linked BF Score:
174
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

mongod can very rarely segfault in signalOplogWaiters(), apparently due to a race with loadCatalog. I suspect signalOplogWaiters is being called without the Global IS lock, so its access to the LocalOplogInfo is unsafe. The crash is reproduced only in a virtualized environment that fuzzes network events, not observed in production or Evergreen. It happens toward the end of rollback.

In an example crash, the ReplCoord-6 thread segfaults with this backtrace:

Thread 1 (Thread 0x7f51cba6e700 (LWP 91)):
mongo::repl::signalOplogWaiters () at src/mongo/db/repl/oplog.cpp:1828
mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters (this=0x564124103600, committedOpTime=...) at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:995
mongo::repl::ReplicationCoordinatorImpl::_advanceCommitPoint (this=0x56412417b800, lk=..., committedOpTimeAndWallTime=..., fromSyncSource=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl.cpp:5060
mongo::repl::ReplicationCoordinatorImpl::_handleHeartbeatResponse (this=<optimized out>, cbData=...) at src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp:237
std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)>::operator()(mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&) const (this=0x5641299a69a0, __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4::operator()(mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) const (this=0x5641299a69a0, args=...) at src/mongo/executor/task_executor.cpp:244
std::_Function_handler<void (mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&), mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4>::_M_invoke(std::_Any_data const&, mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) (__functor=..., __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297

Meanwhile the "RECOVERY" thread is in loadCatalog:

Thread 68 (Thread 0x7f51ce374700 (LWP 83)):
__lll_timedwait_tid () from /lib/x86_64-linux-gnu/libpthread.so.0
?? ()
?? ()
?? ()
?? ()
.L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
?? ()
?? ()
?? ()
.L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
mongo::latch_detail::Mutex::_onContendedLock (this=0x56412417b820) at src/mongo/platform/mutex.cpp:106
std::lock_guard<mongo::latch_detail::Latch>::lock_guard (this=<optimized out>, __m=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:162
mongo::repl::ReplicationCoordinatorImpl::getMemberState (this=0x7f51ce370e30) at src/mongo/db/repl/replication_coordinator_impl.cpp:1006
mongo::WiredTigerKVEngine::getGroupedRecordStore (this=0x7f5204392d5a <mongo::latch_detail::Mutex::lock()+74>, opCtx=<optimized out>, ns=..., ident="collection-4--4508029134355656785", options=..., prefix=...) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:1502
mongo::StorageEngineImpl::_initCollection (this=0x564124199c00, opCtx=0x2, catalogId=..., nss=..., forRepair=false, minVisibleTs=...) at src/mongo/db/storage/storage_engine_impl.cpp:322
mongo::StorageEngineImpl::loadCatalog (this=0x564124199c00, opCtx=0x5641299ae500, loadingFromUncleanShutdown=false) at src/mongo/db/storage/storage_engine_impl.cpp:288
mongo::catalog::openCatalog (opCtx=0x80, minVisibleTimestampMap=std::map with 22 elements = {...}, stableTimestamp=...) at src/mongo/db/catalog/catalog_control.cpp:122
mongo::StorageEngineImpl::recoverToStableTimestamp (this=0x564124199c00, opCtx=0x5641299ae500) at src/mongo/db/storage/storage_engine_impl.cpp:1008
non-virtual thunk to mongo::StorageEngineImpl::recoverToStableTimestamp(mongo::OperationContext*) () at src/mongo/util/invariant.h:71
mongo::repl::StorageInterfaceImpl::recoverToStableTimestamp (this=<optimized out>, opCtx=0x5641299ae500) at src/mongo/db/repl/storage_interface_impl.cpp:1315
mongo::repl::RollbackImpl::_recoverToStableTimestamp (this=<optimized out>, opCtx=<optimized out>) at src/mongo/db/repl/rollback_impl.cpp:1215
mongo::repl::RollbackImpl::_runPhaseFromAbortToReconstructPreparedTxns (this=0x564129779a00, opCtx=0x5641299ae500, commonPoint=...) at src/mongo/db/repl/rollback_impl.cpp:503
mongo::repl::RollbackImpl::runRollback (this=0x564129779a00, opCtx=0x5641299ae500) at src/mongo/db/repl/rollback_impl.cpp:239
mongo::repl::BackgroundSync::_runRollbackViaRecoverToCheckpoint(mongo::OperationContext*, mongo::HostAndPort const&, mongo::repl::OplogInterface*, mongo::repl::StorageInterface*, std::function<mongo::DBClientBase* ()>) (this=0x564128d25380, opCtx=0x5641299ae500, source=..., localOplog=0x7f51ce372690, storageInterface=0x5641241654e0, getConnection=...) at src/mongo/db/repl/bgsync.cpp:826
mongo::repl::BackgroundSync::_runRollback (this=0x564128d25380, opCtx=0x5641299ae500, fetcherReturnStatus=..., source=..., requiredRBID=2, storageInterface=0x5641241654e0) at src/mongo/db/repl/bgsync.cpp:788
mongo::repl::BackgroundSync::_produce (this=<optimized out>) at src/mongo/db/repl/bgsync.cpp:603
mongo::repl::BackgroundSync::_runProducer (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:263
mongo::repl::BackgroundSync::_run (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:218

The source for signalOplogWaiters is:

void signalOplogWaiters() {
    const auto& oplog = LocalOplogInfo::get(getGlobalServiceContext())->getCollection();
    if (oplog) {
        oplog->getCappedCallback()->notifyCappedWaitersIfNeeded(); // <- segfault
    }
}

By reading its disassembly and seeing exactly where it crashed in GDB, I think it crashes calling notifyCappedWaitersIfNeeded(); perhaps the return value of getCappedCallback() is invalid, or refers to an object invalidated after it returns. (I'm not fluent in assembly.)

Assignee:: Vesselina Ratcheva (Inactive)
Reporter:: A. Jesse Jiryu Davis
Participants:: A. Jesse Jiryu Davis, Githook User, Vesselina Ratcheva, Vishnu Kaushik
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Feb 07 2021 06:08:35 PM UTC
Updated:: Oct 29 2023 09:57:49 PM UTC
Resolved:: Jan 27 2022 10:08:27 PM UTC
Confidence Status Last Update:: 25/Jan/22 5:43 AM

Details

Description

Attachments

Forms

Activity

People

Dates