Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-4077

OP_KILLCURSORS incorrectly used to destroy change stream during resume attempt

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      It appears that there may be a libmongoc bug regarding tracking of maxWireVersion for servers when connected to a sharded cluster. I am seeing this specifically with a single-mongos cluster backed by 3 replica sets, started via mlaunch with:

      mlaunch init --replicaset --sharded 3 --setParameter enableTestCommands=1

      We have a Swift test which does the following, to test change streams' automatic resume behavior:
      1. Create a client (in pooled mode, as Swift clients always are)
      2. Use client from underlying pool to create a new collection
      3. Open a change stream on the collection
      4. Insert some documents to the collection
      5. Set the following fail point: {"configureFailPoint":"failCommand","mode":

      {"times":1}

      , "data":{"errorCode":10107,"failCommands":["getMore"],"errorLabels":["ResumableChangeStreamError"]}}
      6. Iterate the change stream. The getMore failpoint will be hit and then the change stream should make a single, successful resume attempt.
      7. Inspect command monitoring events from the above and see that the aggregate was sent exactly twice.

      On server latest (I first saw this on v5.0.0-alpha0-1541-ga8cf4f3 and have observed it on newer versions as well), this test started to fail as an extra aggregate attempt was observed, as the first resume attempt would consistently fail with an error from libmongoc with the domain MONGOC_ERROR_STREAM: "Failed to send \"aggregate\" command with database \"test\": Failed to read 4 bytes: socket error or timeout".

      Of note, is that as part of the resume process drivers including libmongoc attempt to kill the original cursor. I noticed that the server recently merged in SERVER-57457 where a connection will automatically be closed after receiving OP_KILL_CURSORS. With some printf debugging I have determined that OP_KILL_CURSORS is incorrectly being used to kill the cursor in this case after the getMore fails, so therefore the connection is being closed and the initial resume attempt fails.

      Specifically, here server_stream->sd->max_wire_version is incorrectly 0, so the else block is hit.

      I first witnessed this with latest + libmongoc-1.18.0-alpha2, however I have now tested as far back as server 4.4.3 + libmongoc 1.16.2 and saw that the branch using OP_KILLCURSORS is also used on those in this particular scenario. However this was not an issue until just now since the earlier server versions would still accept OP_KILLCURSORS without closing the connection.

      I'll also note this seems to somehow be related to this particular code path, as from my printf testing cleaning up a change stream normally via mongoc_change_stream_destroy does not appear to take the OP_KILLCURSORS path.

      Let me know if you need any more information or help reproducing this.

      Slack thread for context: https://mongodb.slack.com/archives/C72LB5RPV/p1626133519311700

            Assignee:
            Unassigned Unassigned
            Reporter:
            kaitlin.mahar@mongodb.com Kaitlin Mahar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: