[SERVER-54374] Race between signalOplogWaiters and StorageEngine::loadCatalog Created: 07/Feb/21  Updated: 29/Oct/23  Resolved: 27/Jan/22

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 5.3.0, 5.0.7

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
Participants:
Linked BF Score: 174

 Description   

mongod can very rarely segfault in signalOplogWaiters(), apparently due to a race with loadCatalog. I suspect signalOplogWaiters is being called without the Global IS lock, so its access to the LocalOplogInfo is unsafe. The crash is reproduced only in a virtualized environment that fuzzes network events, not observed in production or Evergreen. It happens toward the end of rollback.

In an example crash, the ReplCoord-6 thread segfaults with this backtrace:

Thread 1 (Thread 0x7f51cba6e700 (LWP 91)):
mongo::repl::signalOplogWaiters () at src/mongo/db/repl/oplog.cpp:1828
mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters (this=0x564124103600, committedOpTime=...) at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:995
mongo::repl::ReplicationCoordinatorImpl::_advanceCommitPoint (this=0x56412417b800, lk=..., committedOpTimeAndWallTime=..., fromSyncSource=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl.cpp:5060
mongo::repl::ReplicationCoordinatorImpl::_handleHeartbeatResponse (this=<optimized out>, cbData=...) at src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp:237
std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)>::operator()(mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&) const (this=0x5641299a69a0, __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4::operator()(mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) const (this=0x5641299a69a0, args=...) at src/mongo/executor/task_executor.cpp:244
std::_Function_handler<void (mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&), mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4>::_M_invoke(std::_Any_data const&, mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) (__functor=..., __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297

Meanwhile the "RECOVERY" thread is in loadCatalog:

Thread 68 (Thread 0x7f51ce374700 (LWP 83)):
__lll_timedwait_tid () from /lib/x86_64-linux-gnu/libpthread.so.0
?? ()
?? ()
?? ()
?? ()
.L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
?? ()
?? ()
?? ()
.L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
mongo::latch_detail::Mutex::_onContendedLock (this=0x56412417b820) at src/mongo/platform/mutex.cpp:106
std::lock_guard<mongo::latch_detail::Latch>::lock_guard (this=<optimized out>, __m=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:162
mongo::repl::ReplicationCoordinatorImpl::getMemberState (this=0x7f51ce370e30) at src/mongo/db/repl/replication_coordinator_impl.cpp:1006
mongo::WiredTigerKVEngine::getGroupedRecordStore (this=0x7f5204392d5a <mongo::latch_detail::Mutex::lock()+74>, opCtx=<optimized out>, ns=..., ident="collection-4--4508029134355656785", options=..., prefix=...) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:1502
mongo::StorageEngineImpl::_initCollection (this=0x564124199c00, opCtx=0x2, catalogId=..., nss=..., forRepair=false, minVisibleTs=...) at src/mongo/db/storage/storage_engine_impl.cpp:322
mongo::StorageEngineImpl::loadCatalog (this=0x564124199c00, opCtx=0x5641299ae500, loadingFromUncleanShutdown=false) at src/mongo/db/storage/storage_engine_impl.cpp:288
mongo::catalog::openCatalog (opCtx=0x80, minVisibleTimestampMap=std::map with 22 elements = {...}, stableTimestamp=...) at src/mongo/db/catalog/catalog_control.cpp:122
mongo::StorageEngineImpl::recoverToStableTimestamp (this=0x564124199c00, opCtx=0x5641299ae500) at src/mongo/db/storage/storage_engine_impl.cpp:1008
non-virtual thunk to mongo::StorageEngineImpl::recoverToStableTimestamp(mongo::OperationContext*) () at src/mongo/util/invariant.h:71
mongo::repl::StorageInterfaceImpl::recoverToStableTimestamp (this=<optimized out>, opCtx=0x5641299ae500) at src/mongo/db/repl/storage_interface_impl.cpp:1315
mongo::repl::RollbackImpl::_recoverToStableTimestamp (this=<optimized out>, opCtx=<optimized out>) at src/mongo/db/repl/rollback_impl.cpp:1215
mongo::repl::RollbackImpl::_runPhaseFromAbortToReconstructPreparedTxns (this=0x564129779a00, opCtx=0x5641299ae500, commonPoint=...) at src/mongo/db/repl/rollback_impl.cpp:503
mongo::repl::RollbackImpl::runRollback (this=0x564129779a00, opCtx=0x5641299ae500) at src/mongo/db/repl/rollback_impl.cpp:239
mongo::repl::BackgroundSync::_runRollbackViaRecoverToCheckpoint(mongo::OperationContext*, mongo::HostAndPort const&, mongo::repl::OplogInterface*, mongo::repl::StorageInterface*, std::function<mongo::DBClientBase* ()>) (this=0x564128d25380, opCtx=0x5641299ae500, source=..., localOplog=0x7f51ce372690, storageInterface=0x5641241654e0, getConnection=...) at src/mongo/db/repl/bgsync.cpp:826
mongo::repl::BackgroundSync::_runRollback (this=0x564128d25380, opCtx=0x5641299ae500, fetcherReturnStatus=..., source=..., requiredRBID=2, storageInterface=0x5641241654e0) at src/mongo/db/repl/bgsync.cpp:788
mongo::repl::BackgroundSync::_produce (this=<optimized out>) at src/mongo/db/repl/bgsync.cpp:603
mongo::repl::BackgroundSync::_runProducer (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:263
mongo::repl::BackgroundSync::_run (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:218

The source for signalOplogWaiters is:

void signalOplogWaiters() {
    const auto& oplog = LocalOplogInfo::get(getGlobalServiceContext())->getCollection();
    if (oplog) {
        oplog->getCappedCallback()->notifyCappedWaitersIfNeeded(); // <- segfault
    }
}

By reading its disassembly and seeing exactly where it crashed in GDB, I think it crashes calling notifyCappedWaitersIfNeeded(); perhaps the return value of getCappedCallback() is invalid, or refers to an object invalidated after it returns. (I'm not fluent in assembly.)



 Comments   
Comment by Githook User [ 03/Mar/22 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-54374 Do not attempt to update commit point via heartbeat propagation while in rollback state

(cherry picked from commit 7ebcd89a4f9590b0e17ad36588f6ef7c5a20a291)
Branch: v5.0
https://github.com/mongodb/mongo/commit/7b4102c1d7794b714a038bf52cfdf3165e18d619

Comment by Vishnu Kaushik [ 07/Feb/22 ]

Should this be backported to 5.2 and 5.0 as well? The linked BF-22802 is on 5.0, and recently I worked on 5.2's BF-24227.

Comment by Githook User [ 27/Jan/22 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-54374 Do not attempt to update commit point via heartbeat propagation while in rollback state
Branch: master
https://github.com/mongodb/mongo/commit/7ebcd89a4f9590b0e17ad36588f6ef7c5a20a291

Generated at Thu Feb 08 05:33:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.