-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Replication
-
(copied to CRM)
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Overview
Following investigation in HELP-91789, we identified that SessionCatalog::_mutex (defined in session_catalog.h:285) is a single global stdx::mutex protecting the _sessions map. This mutex is on the hot path for every CRUD operation that runs inside a transaction or a retryable write, and several code paths hold it for potentially long durations, which can add latency under high session concurrency.
Hot-Path Impact
Every command that carries a txnNumber (transactions and retryable writes) acquires _mutex twice per operation — once on session checkout and once on session release:
ServiceEntryPointShardRole::handleRequest()
-> CheckoutSessionAndInvokeCommand::run()
-> _checkOutSession()
-> MongoDSessionCatalog::checkOutSession(opCtx)
-> OperationContextSession::checkOut()
-> SessionCatalog::_checkOutSessionInner()
** ACQUIRES _mutex — waits on condvar, updates session state **
-> runCommandInvocation() [execute the CRUD operation]
-> ~ScopedCheckedOutSession dtor
-> SessionCatalog::_releaseSession()
** RE-ACQUIRES _mutex — eager reaps child sessions, notifies waiters **
This means any contention on _mutex — from concurrent checkouts, session reaping, or session scanning — directly adds latency to user-facing CRUD operations.
Known Concerning Critical Section
During investigation of HELP-91789, scanSessions() was identified as a source of contention. It iterates over all parent sessions and their child sessions while holding the lock (line 201), executing a user-provided workerFn callback for each match. With a large number of sessions this holds the mutex for the entire iteration. In the HELP ticket, this was triggered by currentOp being called frequently by Atlas observability tools (Real Time Performance Panel, Performance Advisor, Schema Advisor, Query Insights), which caused repeated full scans of the session map under the lock. The same pattern applies to scanParentSessions().
All other code paths that acquire _mutex should be audited for similar hold-time concerns.
Acceptance Criteria
- Audit all code paths that acquire SessionCatalog::_mutex and determine worst-case hold times under realistic production workloads (high session count, retryable writes with many child sessions)
- Reduce lock scope where possible while preserving correctness of session checkout, kill, and reap semantics
- is related to
-
SERVER-124275 Add ObservableMutex observability to SessionCatalog::_mutex
-
- Needs Scheduling
-
- related to
-
SERVER-123831 Reduce FTDC gaps due to LogicalSessionServerStatusSection
-
- Open
-