[SERVER-36332] CursorNotFound error in GetMore on a secondary with sessions Created: 27/Jul/18  Updated: 29/Oct/23  Resolved: 28/Aug/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.6.8, 4.0.3, 4.1.3

Type: Bug Priority: Critical - P2
Reporter: Charlie Swanson Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js    
Issue Links:
Backports
Depends
is depended on by SERVER-34053 Cursor not found error when running l... Closed
is depended on by SERVER-34120 scoped connection not being returned ... Closed
Duplicate
is duplicated by SERVER-36808 Server closes cursors that are still ... Closed
is duplicated by SERVER-38101 Secondary Crashing with "aborting aft... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Steps To Reproduce:

I've attached repro.js which reproduces the issue.

Sprint: Sharding 2018-08-27, Sharding 2018-09-10
Participants:
Case:
Linked BF Score: 20

 Description   

Various issues related to sessions synchronization during refresh are caused by the design.
Currently, the only way to update the config.system.sessions collection is by refresh method which is run in a separate thread on the primary.
Secondary does not write to the collection instead it sends sessions to the primary which adds it to a logical sessions cache and eventually writes to the collection.
However the secondary closes the cursors associated with sessions not existing in the sessions collection.
This scenario is possible if secondary and primary refreshes are out of synch. i.e.
1) secondary adds new sessions and opens cursors
2) secondary refresh updates primary logical session cache
3) if secondary rins a refresh now the newly opened sessions will be considered "deleted" because the primary has not yet refreshed

The following sequence of events is an example of this scenario

  1. Primary is unavailable for writes (say it's fsyncLocked).
  2. Client creates a session on a secondary and establishes a cursor without fully iterating it.
  3. The session cache refresh logic kicks in
  4. The secondary sends a refreshSessionsInternal command to the primary (from here) with the sessions it believes are active, which includes this new one.
  5. The primary receives the command, inserts the new session into its cache ('_activeSessions', here), but does not actually write it to system.sessions.
  6. The secondary then attempts to find which sessions it has open cursors for which have actually been timed out, so that it can kill them. To do this, it issues a query to the system.sessions collection on the primary. This collection will not actually have this new session, since it's fsyncLocked, and further the session refresh logic hasn't kicked in yet.

The problem shows manifestation is the "CursorNotFound" error on the GetMore command when running on the secondary.

The fix makes the secondary write to the primary so its always in sync and therefore can avoid "false negatives" checks for sessions existence.



 Comments   
Comment by Githook User [ 11/Sep/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36332 write to primary from secondary in LogicalSessionsCache for ReplicaSet

(cherry picked from commit a3d17a55ca68ba37eb59620e04258f61f133b21f)
Branch: v4.0
https://github.com/mongodb/mongo/commit/3bbae86850a5cb6aa5f264c4a7a400a7b1cacf39

Comment by Githook User [ 06/Sep/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36332 write to primary from secondary in LogicalSessionsCache for ReplicaSet

(cherry picked from commit a3d17a55ca68ba37eb59620e04258f61f133b21f)
Branch: v3.6
https://github.com/mongodb/mongo/commit/8e7efe49690f06942c73afb81dea7a3928e19896

Comment by Kelsey Schubert [ 03/Sep/18 ]

Yes, we intend to backport this fix to 3.6, ondrejk.

Comment by Ondrej Kokes [ 03/Sep/18 ]

Hi, "Fix Versions" says "4.1.3", is this to be backported to 3.6 as well? It says so in "Backport Requested"

Thanks!

Comment by Githook User [ 28/Aug/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-36332 write to primary from secondary in LogicalSessionsCache for ReplicaSet
Branch: master
https://github.com/mongodb/mongo/commit/a3d17a55ca68ba37eb59620e04258f61f133b21f

Generated at Thu Feb 08 04:42:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.