Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-93126

FTDC collection can block on ReplicationCoordinator mutex

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      Part of the server FTDC gathering flow returns the "latestOptime" as part of the "oplog" statistics.

      The code for that is here: https://github.com/10gen/mongo/blob/7577170e2018c7d4ebbbae318b86935f945761da/src/mongo/db/repl/replication_info.cpp#L295

      getMyLastAppliedOpTime takes a mutex: 

      OpTime ReplicationCoordinatorImpl::getMyLastAppliedOpTime() const {
          stdx::lock_guard<Latch> lock(_mutex);
          return _getMyLastAppliedOpTime_inlock();
      }

      Which could take any amount of time. I reproduced a 20 second stall in this section of code on 5.0.26 unintentionally while trying to reproduce a separate FTDC stall issue: SERVER-93120. Given that the lastAppliedOpTime data is being collected for statistical purposes I am curious if this can be done outside of a lock, or using atomics.

            Assignee:
            Unassigned Unassigned
            Reporter:
            luke.pearson@mongodb.com Luke Pearson
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: