[SERVER-33553] OP_KILL_CURSORS fails on mongos: Unable to check out cursor for killCursor Created: 28/Feb/18  Updated: 29/Oct/23  Resolved: 04/Jun/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.3
Fix Version/s: 3.6.6

Type: Bug Priority: Major - P3
Reporter: Shane Harvey Assignee: Ian Boros
Resolution: Fixed Votes: 2
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
causes PYTHON-1491 Test Failure - test_legacy_api.TestLe... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  1. Start a 3.6.3 sharded cluster with auth enabled.
  2. Create a cursor via the find command with a session.
  3. Send an OP_KILL_CURSORS message to mongos.
  4. The cursor is not killed.
Sprint: Query 2018-06-04, Query 2018-06-18
Participants:
Case:

 Description   

Sending OP_KILL_CURSORS to 3.6.3 Mongos fails when auth is enabled with: "Unable to check out cursor for killCursor. Namespace: 'pymongo_test.test', cursor id: 3368188369600609201."

Here is the mongos log including the initial find command, OP_KILL_CURSORS request, and a final getMore that succeeds:

2018-02-28T12:22:07.571-0800 D EXECUTOR [conn47] Scheduling remote command request: RemoteCommand 1473 -- target:localhost:27018 db:pymongo_test cmd:{ find: "test", batchSize: 1, shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:07.571-0800 D ASIO     [conn47] startCommand: RemoteCommand 1473 -- target:localhost:27018 db:pymongo_test cmd:{ find: "test", batchSize: 1, shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:07.571-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Initiating asynchronous command: RemoteCommand 1473 -- target:localhost:27018 db:pymongo_test cmd:{ find: "test", batchSize: 1, shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:07.571-0800 D NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-3-0] Compressing message with zlib
2018-02-28T12:22:07.571-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Starting asynchronous command 1473 on host localhost:27018
2018-02-28T12:22:07.572-0800 D NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-3-0] Decompressing message with zlib
2018-02-28T12:22:07.572-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Request 1473 finished with response: { cursor: { firstBatch: [ { _id: 0 } ], id: 93684278245, ns: "pymongo_test.test" }, ok: 1.0, operationTime: Timestamp(1519849327, 208), $gleStats: { lastOpTime: { ts: Timestamp(1519849053, 208), t: 2 }, electionId: ObjectId('7fffffff0000000000000002') }, $clusterTime: { clusterTime: Timestamp(1519849327, 208), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp(1519849327, 7), t: 2 } } }
2018-02-28T12:22:07.572-0800 D EXECUTOR [NetworkInterfaceASIO-TaskExecutorPool-3-0] Received remote response: RemoteResponse --  cmd:{ cursor: { firstBatch: [ { _id: 0 } ], id: 93684278245, ns: "pymongo_test.test" }, ok: 1.0, operationTime: Timestamp(1519849327, 208), $gleStats: { lastOpTime: { ts: Timestamp(1519849053, 208), t: 2 }, electionId: ObjectId('7fffffff0000000000000002') }, $clusterTime: { clusterTime: Timestamp(1519849327, 208), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp(1519849327, 7), t: 2 } } }
2018-02-28T12:22:07.572-0800 D SHARDING [conn47] Command end db: pymongo_test msg id: 893351816
2018-02-28T12:22:08.344-0800 D NETWORK  [conn47] Request::process begin ns:  msg id: -1116753739 op: killcursors
2018-02-28T12:22:08.344-0800 D SHARDING [conn47] Unable to check out cursor for killCursor.  Namespace: 'pymongo_test.test', cursor id: 3368188369600609201.
2018-02-28T12:22:08.344-0800 D NETWORK  [conn47] Request::process end ns:  msg id: -1116753739 op: killcursors
2018-02-28T12:22:09.578-0800 D SHARDING [conn47] Command begin db: pymongo_test msg id: 1954899097
2018-02-28T12:22:09.578-0800 D EXECUTOR [conn47] Scheduling remote command request: RemoteCommand 1475 -- target:localhost:27018 db:pymongo_test cmd:{ getMore: 93684278245, collection: "test", batchSize: 1, lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:09.578-0800 D ASIO     [conn47] startCommand: RemoteCommand 1475 -- target:localhost:27018 db:pymongo_test cmd:{ getMore: 93684278245, collection: "test", batchSize: 1, lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:09.578-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Initiating asynchronous command: RemoteCommand 1475 -- target:localhost:27018 db:pymongo_test cmd:{ getMore: 93684278245, collection: "test", batchSize: 1, lsid: { id: UUID("12da197b-c453-44d7-9a80-f35416b3b5c9"), uid: BinData(0, 78E672A9F1DF895A5FBE23C35A0004466CCD167529F1405D95628DE7416FF5FA) } }
2018-02-28T12:22:09.578-0800 D NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-0-0] Compressing message with zlib
2018-02-28T12:22:09.578-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Starting asynchronous command 1475 on host localhost:27018
2018-02-28T12:22:09.579-0800 D NETWORK  [NetworkInterfaceASIO-TaskExecutorPool-0-0] Decompressing message with zlib
2018-02-28T12:22:09.579-0800 D ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 1475 finished with response: { cursor: { nextBatch: [ { _id: 1 } ], id: 93684278245, ns: "pymongo_test.test" }, ok: 1.0, operationTime: Timestamp(1519849327, 208), $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('7fffffff0000000000000002') }, $clusterTime: { clusterTime: Timestamp(1519849327, 208), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp(1519849327, 7), t: 2 } } }

I can reproduce this on 3.6.3 but not on mongodb latest version:
mongos version v3.7.2-169-g67d04f1
git version: 67d04f1a286b23ea824bdfb7042462d7da1b515d



 Comments   
Comment by Githook User [ 07/Jun/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-33553 improve legacy_kill_op_on_session_cursor.js
Branch: v3.6
https://github.com/mongodb/mongo/commit/a4fe8cbaa9909ea581331a085f1f736af92d9349

Comment by Githook User [ 04/Jun/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-33553 allow legacy killCursors kill a cursor created in a session
Branch: v3.6
https://github.com/mongodb/mongo/commit/b79d5b258fbc8db21ac7af978e706d8ea35da236

Comment by Charlie Swanson [ 10/May/18 ]

Bumping this back into "Needs Triage".

Comment by Charlie Swanson [ 23/Mar/18 ]

We believe that the fix outlined by Ian above is the correct fix, but we are not scheduling this immediately.

Comment by Ian Whalen (Inactive) [ 16/Mar/18 ]

ian.boros can you please talk with behackett about mixing sessions and OP_KILL_CURSORS and we'll come back to this in a future triage?

Comment by David Storch [ 07/Mar/18 ]

It's unlikely that we can backport the change from 3.7, since it was part of a larger effort to improve how sharded queries are killed. We would likely pursue a custom for the 3.6 branch under this ticket.

Comment by Bernie Hackett [ 02/Mar/18 ]

I like that theory.

Can we backport the 3.7 change?

Comment by Ian Boros [ 01/Mar/18 ]

Here's a theory:

In strategy.cpp (the codepath for OP_KILL_CURSORS) we try to check out the cursor:
https://github.com/mongodb/mongo/blob/ae05629559f08db040453bf2b05575c125f83999/src/mongo/s/commands/strategy.cpp#L553

There, checkOutCursor is called with the default checkSessionAuth value of kCheckSession:
https://github.com/mongodb/mongo/blob/ae05629559f08db040453bf2b05575c125f83999/src/mongo/s/query/cluster_cursor_manager.h#L317-L321

That code (in strategy.cpp) is relatively new on 3.6 (check the blame). The commit date is January 10 (note the author date is pretty old).

I recently made a change on master so that we don't attempt to check out a cursor before killing it when running OP_KILL_CURSOR.

This might explain why this is a problem only in 3.6.3 and not 3.6.0 or master, and also why it only happens with OP_KILL_CURSOR. The code path for the killcursor "command" passes kNoCheckSession to checkOutCursor. My guess is maybe we need to update the call in strategy.cpp to do this as well?

I can't reproduce this using the shell, and I think the reason is what Dave pointed out about sessions not being supported with legacy readMode.

Comment by Bernie Hackett [ 01/Mar/18 ]

Is this SERVER-32169 creeping back?

Comment by Bernie Hackett [ 01/Mar/18 ]

Implicit sessions were a requirement for drivers. Again, I agree this situation is weird. Interestingly, this bug also seems to have been in earlier versions of 3.7 (see the linked PYTHON ticket), but doesn't manifest in the current master codebase. I can't tell from the git log what might have changed to resolve it.

Comment by David Storch [ 01/Mar/18 ]

Sessions aren't even supported with legacy readMode, right? Do we have any guarantees about things working if the client combines legacy wire ops like OP_KILL_CURSORS with cursors created inside a logical session?

Comment by Shane Harvey [ 28/Feb/18 ]

OP_KILL_CURSORS works fine with PyMongo 3.5.1 where the only difference is that there's no session sent with the find command. The bug seems to be a cursor created with a session cannot be killed on mongos via OP_KILL_CURSORS.

Comment by Bernie Hackett [ 28/Feb/18 ]

The related Python test passes with MongoDB 3.6.0 and 3.7.latest. The test case is a bit weird. It tests an unfortunate situation caused by a deprecated PyMongo API we can't break before the next major version bump.

shane.harvey, can you try to reproduce this with an old PyMongo version that doesn't support the find command? 3.1.x should do it. I'm concerned that OP_KILL_CURSORS is just broken, period, which would break any application that uses an old driver.

Generated at Thu Feb 08 04:33:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.