Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34810

Session cache refresh can erroneously kill cursors that are still in use

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6.6, 4.0.1, 4.1.1
    • Component/s: None
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0
    • Steps To Reproduce:
      Hide

      This is a race condition which is only easy to reproduce consistently by instrumenting the server. The following patch causes the server to sleep for some time during LogicalSessionCache::_refresh():

      diff --git a/src/mongo/db/logical_session_cache_impl.cpp b/src/mongo/db/logical_session_cache_impl.cpp
      index e0a5e8de31..4c7aa049eb 100644
      --- a/src/mongo/db/logical_session_cache_impl.cpp
      +++ b/src/mongo/db/logical_session_cache_impl.cpp
      @@ -334,6 +334,10 @@ void LogicalSessionCacheImpl::_refresh(Client* client) {
               _stats.setLastSessionsCollectionJobEntriesEnded(explicitlyEndingSessions.size());
           }
       
      +    // We have already refreshed the SessionsCollection, but we haven't yet tried to kill cursors
      +    // for sessions that we don't see inside the SessionsCollection. A new session coming into being
      +    // in this timeframe will not be handled correctly.
      +    sleepsecs(20);
       
           // Find which running, but not recently active sessions, are expired, and add them
           // to the list of sessions to kill cursors for
      

      Start a standalone mongod with --setParameter enableTestCommands=true. From one mongo shell, force a session refresh by running the following:

      db.adminCommand({refreshLogicalSessionCacheNow: 1});
      

      While the server is sleeping inside the session refresh, run the following:

      let session = db.getMongo().startSession();
      let db = session.getDatabase("test");
      db.c.drop();
      db.c.insert({});
      db.c.insert({});
      db.c.insert({});
      let cursor = db.c.find().batchSize(2)
      cursor.next()
      

      When the session refresh completes, the cursor will no longer be open. You can observe this by running cursor.itcount() and receiving a CursorNotFound error.

      Show
      This is a race condition which is only easy to reproduce consistently by instrumenting the server. The following patch causes the server to sleep for some time during LogicalSessionCache::_refresh() : diff --git a/src/mongo/db/logical_session_cache_impl.cpp b/src/mongo/db/logical_session_cache_impl.cpp index e0a5e8de31..4c7aa049eb 100644 --- a/src/mongo/db/logical_session_cache_impl.cpp +++ b/src/mongo/db/logical_session_cache_impl.cpp @@ -334,6 +334,10 @@ void LogicalSessionCacheImpl::_refresh(Client* client) { _stats.setLastSessionsCollectionJobEntriesEnded(explicitlyEndingSessions.size()); }   + // We have already refreshed the SessionsCollection, but we haven't yet tried to kill cursors + // for sessions that we don't see inside the SessionsCollection. A new session coming into being + // in this timeframe will not be handled correctly. + sleepsecs(20);   // Find which running, but not recently active sessions, are expired, and add them // to the list of sessions to kill cursors for Start a standalone mongod with --setParameter enableTestCommands=true . From one mongo shell, force a session refresh by running the following: db.adminCommand({refreshLogicalSessionCacheNow: 1}); While the server is sleeping inside the session refresh, run the following: let session = db.getMongo().startSession(); let db = session.getDatabase("test"); db.c.drop(); db.c.insert({}); db.c.insert({}); db.c.insert({}); let cursor = db.c.find().batchSize(2) cursor.next() When the session refresh completes, the cursor will no longer be open. You can observe this by running cursor.itcount() and receiving a CursorNotFound error.
    • Sprint:
      Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18, Sharding 2018-07-02, Sharding 2018-07-16
    • Case:
    • Linked BF Score:
      16

      Description

      Session information is stored in the system.sessions collection in the config database. Information about active sessions is cached in the LogicalSessionCache. The cache is periodically refreshed, which both

      1. kills cursors inside sessions that are no longer present in the underlying system.sessions collection, and
      2. flushes new cached session information out to system.sessions.

      Suppose that a cache refresh is happening concurrently with a startSession command. It is possible for a session's cursor to be unexpectedly killed out from under the client's feet if the session record has not yet been written out to the system.sessions collection. The cache refresh code attempts to write new sessions out to system.sessions prior to killing any cursors. However, there is no synchronization to ensure that in between writing out these new sessions and killing cursors, a new session does not come into being. This means that the following can take place:

      1. A cache refresh begins, and active cache entries are written to system.sessions.
      2. A new session is started and enters the LogicalSessionCache. A cursor is opened inside this session.
      3. The refresh code notices that a there is a session with a cursor which is not represented in system.sessions. It kills the cursor, despite the cursor still being in use by the client and the session still being alive.

      Fix Implementation

      The issue is caused by a race in LogicalSessionCache.
      If method LogicalSessionCacheImpl::_addToCache https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L392
      adds session between https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L333 and https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L357 then it considered removed because its not in the sessions collection and get killed
      To fix the sessions freshly added to the activeSessions set in the _addToCache method must have an attribute that indicates if they were synched with the sessions collections. Initially its false and once the refreshSessions is called its true.
      Hence findRemovedSessions must only look at the sessions that have this attribute set to true.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                5 Vote for this issue
                Watchers:
                24 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: