Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-54374

Race between signalOplogWaiters and StorageEngine::loadCatalog

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.3.0, 5.0.7
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • v5.0
    • Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
    • 174

      mongod can very rarely segfault in signalOplogWaiters(), apparently due to a race with loadCatalog. I suspect signalOplogWaiters is being called without the Global IS lock, so its access to the LocalOplogInfo is unsafe. The crash is reproduced only in a virtualized environment that fuzzes network events, not observed in production or Evergreen. It happens toward the end of rollback.

      In an example crash, the ReplCoord-6 thread segfaults with this backtrace:

      Thread 1 (Thread 0x7f51cba6e700 (LWP 91)):
      mongo::repl::signalOplogWaiters () at src/mongo/db/repl/oplog.cpp:1828
      mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters (this=0x564124103600, committedOpTime=...) at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp:995
      mongo::repl::ReplicationCoordinatorImpl::_advanceCommitPoint (this=0x56412417b800, lk=..., committedOpTimeAndWallTime=..., fromSyncSource=<optimized out>) at src/mongo/db/repl/replication_coordinator_impl.cpp:5060
      mongo::repl::ReplicationCoordinatorImpl::_handleHeartbeatResponse (this=<optimized out>, cbData=...) at src/mongo/db/repl/replication_coordinator_impl_heartbeat.cpp:237
      std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)>::operator()(mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&) const (this=0x5641299a69a0, __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
      mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4::operator()(mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) const (this=0x5641299a69a0, args=...) at src/mongo/executor/task_executor.cpp:244
      std::_Function_handler<void (mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&), mongo::executor::TaskExecutor::scheduleRemoteCommand(mongo::executor::RemoteCommandRequestImpl<mongo::HostAndPort> const&, std::function<void (mongo::executor::TaskExecutor::RemoteCommandCallbackArgs const&)> const&, std::shared_ptr<mongo::Baton> const&)::$_4>::_M_invoke(std::_Any_data const&, mongo::executor::TaskExecutor::RemoteCommandOnAnyCallbackArgs const&) (__functor=..., __args=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297
      

      Meanwhile the "RECOVERY" thread is in loadCatalog:

      Thread 68 (Thread 0x7f51ce374700 (LWP 83)):
      __lll_timedwait_tid () from /lib/x86_64-linux-gnu/libpthread.so.0
      ?? ()
      ?? ()
      ?? ()
      ?? ()
      .L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
      ?? ()
      ?? ()
      ?? ()
      .L__sancov_gen_.38 () from /home/emptysquare/08b70f65-vanilla/install/lib/libbase.so
      mongo::latch_detail::Mutex::_onContendedLock (this=0x56412417b820) at src/mongo/platform/mutex.cpp:106
      std::lock_guard<mongo::latch_detail::Latch>::lock_guard (this=<optimized out>, __m=...) at /opt/mongodbtoolchain/revisions/96f92f5fe92913d80957401dc8800c2175a5d728/stow/gcc-v3.xOh/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_mutex.h:162
      mongo::repl::ReplicationCoordinatorImpl::getMemberState (this=0x7f51ce370e30) at src/mongo/db/repl/replication_coordinator_impl.cpp:1006
      mongo::WiredTigerKVEngine::getGroupedRecordStore (this=0x7f5204392d5a <mongo::latch_detail::Mutex::lock()+74>, opCtx=<optimized out>, ns=..., ident="collection-4--4508029134355656785", options=..., prefix=...) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:1502
      mongo::StorageEngineImpl::_initCollection (this=0x564124199c00, opCtx=0x2, catalogId=..., nss=..., forRepair=false, minVisibleTs=...) at src/mongo/db/storage/storage_engine_impl.cpp:322
      mongo::StorageEngineImpl::loadCatalog (this=0x564124199c00, opCtx=0x5641299ae500, loadingFromUncleanShutdown=false) at src/mongo/db/storage/storage_engine_impl.cpp:288
      mongo::catalog::openCatalog (opCtx=0x80, minVisibleTimestampMap=std::map with 22 elements = {...}, stableTimestamp=...) at src/mongo/db/catalog/catalog_control.cpp:122
      mongo::StorageEngineImpl::recoverToStableTimestamp (this=0x564124199c00, opCtx=0x5641299ae500) at src/mongo/db/storage/storage_engine_impl.cpp:1008
      non-virtual thunk to mongo::StorageEngineImpl::recoverToStableTimestamp(mongo::OperationContext*) () at src/mongo/util/invariant.h:71
      mongo::repl::StorageInterfaceImpl::recoverToStableTimestamp (this=<optimized out>, opCtx=0x5641299ae500) at src/mongo/db/repl/storage_interface_impl.cpp:1315
      mongo::repl::RollbackImpl::_recoverToStableTimestamp (this=<optimized out>, opCtx=<optimized out>) at src/mongo/db/repl/rollback_impl.cpp:1215
      mongo::repl::RollbackImpl::_runPhaseFromAbortToReconstructPreparedTxns (this=0x564129779a00, opCtx=0x5641299ae500, commonPoint=...) at src/mongo/db/repl/rollback_impl.cpp:503
      mongo::repl::RollbackImpl::runRollback (this=0x564129779a00, opCtx=0x5641299ae500) at src/mongo/db/repl/rollback_impl.cpp:239
      mongo::repl::BackgroundSync::_runRollbackViaRecoverToCheckpoint(mongo::OperationContext*, mongo::HostAndPort const&, mongo::repl::OplogInterface*, mongo::repl::StorageInterface*, std::function<mongo::DBClientBase* ()>) (this=0x564128d25380, opCtx=0x5641299ae500, source=..., localOplog=0x7f51ce372690, storageInterface=0x5641241654e0, getConnection=...) at src/mongo/db/repl/bgsync.cpp:826
      mongo::repl::BackgroundSync::_runRollback (this=0x564128d25380, opCtx=0x5641299ae500, fetcherReturnStatus=..., source=..., requiredRBID=2, storageInterface=0x5641241654e0) at src/mongo/db/repl/bgsync.cpp:788
      mongo::repl::BackgroundSync::_produce (this=<optimized out>) at src/mongo/db/repl/bgsync.cpp:603
      mongo::repl::BackgroundSync::_runProducer (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:263
      mongo::repl::BackgroundSync::_run (this=0x564128d25380) at src/mongo/db/repl/bgsync.cpp:218
      

      The source for signalOplogWaiters is:

      void signalOplogWaiters() {
          const auto& oplog = LocalOplogInfo::get(getGlobalServiceContext())->getCollection();
          if (oplog) {
              oplog->getCappedCallback()->notifyCappedWaitersIfNeeded(); // <- segfault
          }
      }
      

      By reading its disassembly and seeing exactly where it crashed in GDB, I think it crashes calling notifyCappedWaitersIfNeeded(); perhaps the return value of getCappedCallback() is invalid, or refers to an object invalidated after it returns. (I'm not fluent in assembly.)

            Assignee:
            vesselina.ratcheva@mongodb.com Vesselina Ratcheva (Inactive)
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: