|
Per in-person discussion with tess.avitabile, we can imagine multiple ways to fix this:
- Change the mongos killCursors path to wait to hear back that the cursors on the all the shards have been killed. The current design is "best effort"---that is, the AsyncResultsMerger issues killCursors once on all of the remote cursors it manages, but it does not interpret the killCursors responses or have retry logic in case there is a failure. This is typically not an issue in practice. In the unusual case that the cleanup logic fails and a cursor gets abandoned, it will eventually get reaped by a background job responsible for destroying idle cursors.
- Prevent killCursors commands from starting a transaction. The AsyncResultsMerger is designed to issue killCursors and getMore commands to the shards asynchronously. This is only an issue for killCursors, and not for getMore, because a transaction cannot be successfully started with a getMore command. There is no use case for opening a transaction with a killCursors command, so we could make this an error as well.
My preference is to pursue option #2. This work would fall on the replication team, so I'm reassigning to repl for triage.
|