[SERVER-34810] Session cache refresh can erroneously kill cursors that are still in use Created: 02/May/18  Updated: 29/Oct/23  Resolved: 03/Jul/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6.6, 4.0.1, 4.1.1

Type: Bug Priority: Major - P3
Reporter: David Storch Assignee: Misha Tyulenev
Resolution: Fixed Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by NODE-1482 Cursor not found issue Closed
Duplicate
is duplicated by SERVER-34053 Cursor not found error when running l... Closed
is duplicated by SERVER-35484 Active cursor with Session disappears... Closed
Related
is related to SERVER-36808 Server closes cursors that are still ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Steps To Reproduce:

This is a race condition which is only easy to reproduce consistently by instrumenting the server. The following patch causes the server to sleep for some time during LogicalSessionCache::_refresh():

diff --git a/src/mongo/db/logical_session_cache_impl.cpp b/src/mongo/db/logical_session_cache_impl.cpp
index e0a5e8de31..4c7aa049eb 100644
--- a/src/mongo/db/logical_session_cache_impl.cpp
+++ b/src/mongo/db/logical_session_cache_impl.cpp
@@ -334,6 +334,10 @@ void LogicalSessionCacheImpl::_refresh(Client* client) {
         _stats.setLastSessionsCollectionJobEntriesEnded(explicitlyEndingSessions.size());
     }
 
+    // We have already refreshed the SessionsCollection, but we haven't yet tried to kill cursors
+    // for sessions that we don't see inside the SessionsCollection. A new session coming into being
+    // in this timeframe will not be handled correctly.
+    sleepsecs(20);
 
     // Find which running, but not recently active sessions, are expired, and add them
     // to the list of sessions to kill cursors for

Start a standalone mongod with --setParameter enableTestCommands=true. From one mongo shell, force a session refresh by running the following:

db.adminCommand({refreshLogicalSessionCacheNow: 1});

While the server is sleeping inside the session refresh, run the following:

let session = db.getMongo().startSession();
let db = session.getDatabase("test");
db.c.drop();
db.c.insert({});
db.c.insert({});
db.c.insert({});
let cursor = db.c.find().batchSize(2)
cursor.next()

When the session refresh completes, the cursor will no longer be open. You can observe this by running cursor.itcount() and receiving a CursorNotFound error.

Sprint: Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18, Sharding 2018-07-02, Sharding 2018-07-16
Participants:
Case:
Linked BF Score: 16

 Description   

Session information is stored in the system.sessions collection in the config database. Information about active sessions is cached in the LogicalSessionCache. The cache is periodically refreshed, which both

  1. kills cursors inside sessions that are no longer present in the underlying system.sessions collection, and
  2. flushes new cached session information out to system.sessions.

Suppose that a cache refresh is happening concurrently with a startSession command. It is possible for a session's cursor to be unexpectedly killed out from under the client's feet if the session record has not yet been written out to the system.sessions collection. The cache refresh code attempts to write new sessions out to system.sessions prior to killing any cursors. However, there is no synchronization to ensure that in between writing out these new sessions and killing cursors, a new session does not come into being. This means that the following can take place:

  1. A cache refresh begins, and active cache entries are written to system.sessions.
  2. A new session is started and enters the LogicalSessionCache. A cursor is opened inside this session.
  3. The refresh code notices that a there is a session with a cursor which is not represented in system.sessions. It kills the cursor, despite the cursor still being in use by the client and the session still being alive.

Fix Implementation

The issue is caused by a race in LogicalSessionCache.
If method LogicalSessionCacheImpl::_addToCache https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L392
adds session between https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L333 and https://github.com/mongodb/mongo/blob/r4.1.0/src/mongo/db/logical_session_cache_impl.cpp#L357 then it considered removed because its not in the sessions collection and get killed
To fix the sessions freshly added to the activeSessions set in the _addToCache method must have an attribute that indicates if they were synched with the sessions collections. Initially its false and once the refreshSessions is called its true.
Hence findRemovedSessions must only look at the sessions that have this attribute set to true.



 Comments   
Comment by Marc Smith [ 19/Dec/18 ]

My problem ended up being my mongos servers were sitting behind a load balancer. I didn't know Mongo requires "sticky sessions" to mongos. Since session affinity wasn't working with Kubernetes services, My solution was to give each application servers it's own mongos server in the same Kubernetes pod.

 

Comment by Eric Milkie [ 19/Dec/18 ]

Please file a new SERVER ticket with details of the problem so we can investigate anew. This message could be produced by many different issues.

Comment by Julius Sakalys [ 19/Dec/18 ]

I can double what Marc said (https://jira.mongodb.org/browse/SERVER-34810?focusedCommentId=2084929&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-2084929)

Getting lots of these on 4.0.2

Comment by Marc Smith [ 09/Dec/18 ]

I am also receiving "MongoError: Cursor not found" a LOT running 4.0.4. I don't know if this is the cause but I think it warrants being reopened.

Comment by Githook User [ 03/Jul/18 ]

Author:

{'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}

Message: SERVER-34810 do not kill new sessions in logical session cache

(cherry picked from commit 57d7938c49da06122d4d43054ff89e1881d0209f)
Branch: v3.6
https://github.com/mongodb/mongo/commit/6405d65b1d6432e138b44c13085d0c2fe235d6bd

Comment by Githook User [ 03/Jul/18 ]

Author:

{'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}

Message: SERVER-34810 do not kill new sessions in logical session cache

(cherry picked from commit 57d7938c49da06122d4d43054ff89e1881d0209f)
Branch: v4.0
https://github.com/mongodb/mongo/commit/1143fed2aec167ee0fea08f6e62fd382d4252769

Comment by Githook User [ 03/Jul/18 ]

Author:

{'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}

Message: SERVER-34810 do not kill new sessions in logical session cache
Branch: master
https://github.com/mongodb/mongo/commit/57d7938c49da06122d4d43054ff89e1881d0209f

Comment by Simon Tretter [ 27/Jun/18 ]

Did this fix make it into the 4.0 release?

Thanks

 

Update 28/Jun/18: It did not, or at least the issue still exists. Am I the only one who considers this bug as a deal breaker?

Comment by David Storch [ 02/May/18 ]

kaloian.manassiev, this is causing the parallel suites to fail in Evergreen, so I think it would be a good idea to schedule a fix.

Generated at Thu Feb 08 04:37:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.