Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-88060

"Cursor not found" error when mongos instances disappear from the SRV record

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • Environment:
      MongoDB 5 data and config nodes run on GCP VMs. Mongos runs as a Kubernetes deployment with a Kubernetes headless service acting as the LB that serves SRV records. The driver we're using is Go 1.13.2
    • Catalog and Routing
    • ALL
    • Hide

      This is hard to reproduce reliably, since it's essentially a race condition, but here are the general steps that seem to trigger the error:

      1. App starts a query which needs more than one batch of results to complete. e.g. default batch size is unchanged and the query returns >101 results.
      2. A mongos instance sends the first batch to the app.
      3. The same mongos instance is taken out of rotation by the load balancer and the SRV record is updated such that the instance isn't listed there anymore.
      4. Driver refreshes the SRV record in the background and removes the mongos instance from its list of available instances.
      5. App finishes processing the first batch of results and the driver transparently requests the second batch, but this time it has to query a different mongos instance, since the original one is no longer available. This other mongos instance doesn't have the original cursor and returns the "cursor not found" error.
      Show
      This is hard to reproduce reliably, since it's essentially a race condition, but here are the general steps that seem to trigger the error: App starts a query which needs more than one batch of results to complete. e.g. default batch size is unchanged and the query returns >101 results. A mongos instance sends the first batch to the app. The same mongos instance is taken out of rotation by the load balancer and the SRV record is updated such that the instance isn't listed there anymore. Driver refreshes the SRV record in the background and removes the mongos instance from its list of available instances. App finishes processing the first batch of results and the driver transparently requests the second batch, but this time it has to query a different mongos instance, since the original one is no longer available. This other mongos instance doesn't have the original cursor and returns the "cursor not found" error.

      When using the mongodb+srv:// scheme in the connection string in a sharded cluster, if the load balancer in front of mongos instances takes an instance out of rotation and updates the SRV record, apps could get a "Cursor not found" error if they're in the middle of executing a multi-batch cursor read against that instance. These errors require special handling within the app code. This seems like a bug, because neither the app nor the mongos operator can do anything to prevent these errors from occurring.

      As an app developer I'd like for this situation to be handled transparently by either the server or the driver. I'm unfamiliar with Mongo internals, so the following suggestions/questions may be naive:

      1. Could mongos replicate live cursors to other mongos instances such that drivers could access any mongos instance?
      2. Could cursor information be sent to the driver, so the driver could itself detect this situation and send the cursor info to another mongos instance and have it resume cursor iteration?
      3. Could the driver preserve old mongos instances in its local SRV record cache for some configurable duration? This duration could then be set to maximum expected query runtime within the application, and the mongos deployment updated to match (so it doesn't shut down earlier than that).
      4. Could the driver transparently manage batching without resorting to stateful server-side cursors? i.e. could the driver ensure that there's only one batch per query, even if several one-batch queries need to be made to satisfy the query the app is making?

            Assignee:
            Unassigned Unassigned
            Reporter:
            dv@glyphy.com D V
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: