Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36332

CursorNotFound error in GetMore on a secondary with sessions

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6.8, 4.0.3, 4.1.3
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0, v3.6
    • Steps To Reproduce:
      Hide

      I've attached repro.js which reproduces the issue.

      Show
      I've attached  repro.js  which reproduces the issue.
    • Sprint:
      Sharding 2018-08-27, Sharding 2018-09-10
    • Case:
    • Linked BF Score:
      20

      Description

      Various issues related to sessions synchronization during refresh are caused by the design.
      Currently, the only way to update the config.system.sessions collection is by refresh method which is run in a separate thread on the primary.
      Secondary does not write to the collection instead it sends sessions to the primary which adds it to a logical sessions cache and eventually writes to the collection.
      However the secondary closes the cursors associated with sessions not existing in the sessions collection.
      This scenario is possible if secondary and primary refreshes are out of synch. i.e.
      1) secondary adds new sessions and opens cursors
      2) secondary refresh updates primary logical session cache
      3) if secondary rins a refresh now the newly opened sessions will be considered "deleted" because the primary has not yet refreshed

      The following sequence of events is an example of this scenario

      1. Primary is unavailable for writes (say it's fsyncLocked).
      2. Client creates a session on a secondary and establishes a cursor without fully iterating it.
      3. The session cache refresh logic kicks in
      4. The secondary sends a refreshSessionsInternal command to the primary (from here) with the sessions it believes are active, which includes this new one.
      5. The primary receives the command, inserts the new session into its cache ('_activeSessions', here), but does not actually write it to system.sessions.
      6. The secondary then attempts to find which sessions it has open cursors for which have actually been timed out, so that it can kill them. To do this, it issues a query to the system.sessions collection on the primary. This collection will not actually have this new session, since it's fsyncLocked, and further the session refresh logic hasn't kicked in yet.

      The problem shows manifestation is the "CursorNotFound" error on the GetMore command when running on the secondary.

      The fix makes the secondary write to the primary so its always in sync and therefore can avoid "false negatives" checks for sessions existence.

        Attachments

        1. repro.js
          0.8 kB
          Charlie Swanson

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: