[SERVER-36587] Committing a transaction which was started with a killCursors command can fail Created: 10/Aug/18  Updated: 29/Oct/23  Resolved: 14/Nov/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 4.1.6

Type: Bug Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Samyukta Lanka
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-37045 Consider not allowing killCursors to ... Closed
Documented
is documented by DOCS-12204 Docs for SERVER-36587: Committing a t... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-11-19
Participants:
Linked BF Score: 4

 Description   

AsyncResultsMerger::blockingKill() sends a killCursors command to each of its remotes. However, it discards the callback handle for the request, so blockingKill() can return before the killCursors command reaches the remote. This means that if the client issues a killCursors command, and then a commitTransaction command, a shard can receive the commitTransaction before the killCursors. This is problematic in combination with transactions. Consider the following sequence of events:

  • User starts transaction 0 with killCursors sent to mongos.
  • User receives ok response.
  • User sends commitTransaction for transaction 0 to mongos.
  • mongod receives commitTransaction, which fails since transaction 0 does not yet exist.
  • mongod receives killCursors, which starts transaction 0.

Now the mongod is stuck with transaction 0 open.



 Comments   
Comment by Githook User [ 14/Nov/18 ]

Author:

{'name': 'Samy Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}

Message: SERVER-36587 Disallow the first operation in a transaction to be killCursors
Branch: master
https://github.com/mongodb/mongo/commit/c1d4e0b8e1a4c197aac2530259f78eb88fb4acd3

Comment by David Storch [ 27/Aug/18 ]

Per in-person discussion with tess.avitabile, we can imagine multiple ways to fix this:

  1. Change the mongos killCursors path to wait to hear back that the cursors on the all the shards have been killed. The current design is "best effort"---that is, the AsyncResultsMerger issues killCursors once on all of the remote cursors it manages, but it does not interpret the killCursors responses or have retry logic in case there is a failure. This is typically not an issue in practice. In the unusual case that the cleanup logic fails and a cursor gets abandoned, it will eventually get reaped by a background job responsible for destroying idle cursors.
  2. Prevent killCursors commands from starting a transaction. The AsyncResultsMerger is designed to issue killCursors and getMore commands to the shards asynchronously. This is only an issue for killCursors, and not for getMore, because a transaction cannot be successfully started with a getMore command. There is no use case for opening a transaction with a killCursors command, so we could make this an error as well.

My preference is to pursue option #2. This work would fall on the replication team, so I'm reassigning to repl for triage.

Comment by David Storch [ 27/Aug/18 ]

tess.avitabile can you elaborate on the problem? I'm not sure I understand. Also, what's the impact of the problem? What bad thing does this cause?

Generated at Thu Feb 08 04:43:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.