[SERVER-33683] Allow aggregation $mergeCursors stage to run inside a transaction Created: 05/Mar/18  Updated: 29/Oct/23  Resolved: 17/Dec/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Sharding
Affects Version/s: None
Fix Version/s: 4.1.7

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Ian Boros
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-37665 Add interface to check in and check o... Closed
is depended on by SERVER-37619 Make $out writes retryable Backlog
Duplicate
is duplicated by SERVER-34015 Remove uasserts in cluster_aggregate ... Closed
Related
related to SERVER-37499 Potential deadlock when using exchang... Closed
related to SERVER-64407 Add ResourceYielder support to ARS Closed
is related to SERVER-33660 Once getMores include lsid, sharded a... Closed
is related to SERVER-33029 Support snapshot in cluster aggregate... Closed
is related to SERVER-33541 Add snapshot read support for aggrega... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2018-12-03, Query 2018-12-17, Query 2018-12-31
Participants:

 Description   

We banned aggregations within transactions on mongos as part of SERVER-33660, because doing so could potentially cause a deadlock. We should investigate how to perform a $mergeCursors against the same shard without inducing a deadlock. One idea is to avoid going over the network for a local cursor.



 Comments   
Comment by Githook User [ 17/Dec/18 ]

Author:

{'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}

Message: SERVER-33683 Prevent deadlock in aggregate with transactions
Branch: master
https://github.com/mongodb/mongo/commit/d40d24abc025690150ccf8009ba1facb9ed1c6b2

Comment by Kaloian Manassiev [ 29/Nov/18 ]

I agree that checking-in the session should also stash the resources - sorry I missed this while working on the dependent SERVER-37665. Basically at the point where you decide to relinquish control of that thread to the remote node, the calling thread should start looking as if it returned from a getMore and be completely "pristine". So yes, this is how this should work. This logic should be on top of the OperationContextSession - baked in the MongoDOperationContextSession, which should also expose a check-in/check-out methods.

some other operation could start running under the same transaction before the getMore has a chance to check out the session

The original proposal had suggested also keeping a track of a sub-session token, so that only branches from the same invocation can check it out. This means a significant change to the networking protocol though and we decided not to do it just in order to counteract misbehaved client.

Comment by Charlie Swanson [ 28/Nov/18 ]

ian.boros while the case you describe is indeed worrisome, I can't think of anything great we could do to prevent it, and I believe it would already be a problem with something like an ordered bulk write operation through mongos. I think we will just have to be careful that we don't write any invariants that depend on well-behaved clients. This also came up during initial discussions with kaloian.manassiev and we couldn't think of a great way around this.

Comment by Ian Boros [ 28/Nov/18 ]

Oh, James Wahlin (not tagging) just made a really insightful point: Assuming we take the above approach, there's a possibility that while the $mergeCursors is blocking (with the transaction resources stashed), some other operation could start running under the same transaction before the getMore has a chance to check out the session (this might require a poorly-behaved driver). I imagine this could break a lot of things.

Comment by Ian Boros [ 28/Nov/18 ]

After looking into this ticket, having an operation yield its session while it blocks won't be enough to make this work. After an operation running $mergeCursors yields its session and blocks, the operation running getMore (on the same node) will check the session out, but then realize that the transaction is already in progress, though no resources have been stashed (they're still in use by the aggregation running $mergeCursors). This causes it to abort the transaction. To get around this, I believe $mergeCursors will also have to stash its transaction resources while it blocks on the remote requests, and unstash them when it resumes. This way, the getMore will find (and unstash) the resources once it checks out the session. kaloian.manassiev Does this seem reasonable?

CC charlie.swanson

Comment by Charlie Swanson [ 28/Sep/18 ]

This impacts a wide variety of aggregations that can be run. Anything involving a $lookup or $graphLookup, anything which uses allowDiskUse: true in combination with a $sort or $group, and possibly some other cases. I can't give you an estimate on what percentage of aggregations this would be, but my guess would be at least 10%. These cases aren't that outlandish. Once again, the impact here is that these aggregations would not be able to be run within a transaction - I have no idea how common that would be. It seems like a strange, surprising, and undesirable restriction to have to me, so I certainly think it should be at least 4.1 Required - I can't say which epic it would be in.

Comment by Kaloian Manassiev [ 28/Sep/18 ]

Yes, that's what I was looking for, thanks Jack.

charlie.swanson, do you mind summarizing what are the customer-visible limitations if we were to leave shard merging disabled in 4.2 when you get a chance? I want to make sure we don't forget to do that as part of the final sharding support Epic or if we think it is not that important we should make it 4.1 Desired.

Comment by Jack Mulrow [ 28/Sep/18 ]

kaloian.manassiev, misha.tyulenev added this uassert in ClusterAggregate in SERVER-33029 to disallow sending a merging pipeline to a shard in a transaction - is that what you're looking for? 

Comment by Kaloian Manassiev [ 27/Sep/18 ]

Thanks for the detailed explanation charlie.swanson and sorry for the late reply.

Currently, sharded $lookup with transactions is not in the scope of PM-834, but it looks like other aggregation scenarios, which involve $mergeCursors will not work either. I have no idea what is the importance of these scenarios from usability point of view - do you mind explaining that to me? I am trying to decide whether this ticket should remain in the current Epic, move to the final transactions support Epic (PM-564) or we can just make it 4.1 Desired and we get to it, if we get to it.

jack.mulrow@mongodb.com, I remember in 4.0 we did something to disable these scenarios, but I couldn't find the associated tickets or commits. Can you remind me what did we do to disable them?

Comment by Charlie Swanson [ 23/Aug/18 ]

kaloian.manassiev we have not officially designed or written down an approach for anything. This ticket is not necessarily tied to the sharded $lookup epic - since this is a problem today. The sharded $lookup project will introduce another similar scenario to this one - but I'll get back to that later.

The problem for this ticket is more of a self-deadlock within one shard on the state associated with the transaction. It does not necessarily involve a $lookup. If you run the aggregation [\{$group: {_id: "anything"}}], with {allowDiskUse: true}, it will perform a partial group on each shard, then the "group of groups" on a designated merging shard. That merging shard can now have two simultaneous operations going on in the same transaction. First, the operation performing the merge to compute the "group of groups"; This operation will be merging multiple cursors from every shard with chunks for the collection, one of which may be itself. Second, the operation computing the sub-group from this shard. While the first operation is running, it will at some point schedule a getMore on a cursor which the same shard owns. I don't recall all the details, but I think the Session object has to be "checked out" for a transaction, and could only be checked out by one operation. The first operation would check it out, schedule the getMore, then the getMore would be blocked waiting for it to be checked back in - which it never could be until the getMore returned.

Today (in a world without sharded $lookup), we could pursue a fix for this ticket which would avoid sending network requests to ourselves to perform that merge. This would essentially avoid going back through the top of command processing which will attempt to check out the transaction state. It will take a refactor to the AsyncResultsMerger to do so, since the AsyncResultsMerger only knows how to merge cursors which have IDs and can be iterated with getMore commands - but it seems possible.

However, when we start to think about sharded $lookup - we will be able to get into a scenario where two operations are happening simultaneously on one shard as part of the same transaction - and there will be no real way to avoid it. It's a little complicated to set up, but you could end up with a pipeline where one shard can be doing a $lookup which goes to a remote shard, which then has to do a $lookup again and goes back to the original shard. In order to support this scenario, the fix would likely have to involve having the ability to have multiple operations use the same Session object at once.

Comment by Kaloian Manassiev [ 23/Aug/18 ]

Now that the sharded transactions work is coming along, I would like to get back to this ticket and to see how we can meaningfully get $lookup to work with sharded transactions.

Have you guys considered this as part of the Sharded $lookup project and what it would take to fix it? Is it just the local shard which is a problem or also there was a case where a request from a different shard could come back?

To: charlie.swanson, james.wahlin
CC jack.mulrow

Comment by Charlie Swanson [ 23/Mar/18 ]

misha.tyulenev we'd like to have a discussion with the query team about how to fix this, but depending on the approach any engineer could work on this.

Comment by Misha Tyulenev [ 07/Mar/18 ]

Thanks Charlie, thats correct. SERVER-33029 will establish a global snapshot logicalTime that is most likely to be valid and send an aggregate to the individual shards.

Comment by David Storch [ 07/Mar/18 ]

Yes, helpful, thanks for the clarification. I'll retitle the ticket to help clarify this.

Comment by Charlie Swanson [ 07/Mar/18 ]

I think these are separate work items, but SERVER-33029 depends on this one. There is independent work (what I believe is described in SERVER-33029) to compute which snapshot to use cluster-wide, then append that snapshot into each command. Separately, we need to somehow modify the $mergeCursors story to avoid confusing/deadlocking itself, which is what's tracked in this ticket. I have some ideas of how to do this ticket, I have very little information (personally) about how to do SERVER-33029. Does that help?

Comment by David Storch [ 07/Mar/18 ]

misha.tyulenev charlie.swanson, we may have to coordinate on this. It's not clear to me how this is different from SERVER-33029.

Generated at Thu Feb 08 04:34:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.