[SERVER-37499] Potential deadlock when using exchange within a session Created: 05/Oct/18  Updated: 29/Oct/23  Resolved: 19/Dec/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.1.3
Fix Version/s: 4.1.7

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Ian Boros
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-37665 Add interface to check in and check o... Closed
Related
is related to SERVER-33683 Allow aggregation $mergeCursors stage... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2018-12-17, Query 2018-12-31
Participants:

 Description   

During an exchange, we have the possibility of one thread checking out the session catalog state for the session, then blocking while waiting for another thread which is part of the same session to perform work. Because all the consumers of the exchange would be in the same session, that thread which the original thread is waiting on would be unable to proceed because it first needs to check out the session catalog state, which the original state is holding.

For example, consider an exchange with two consumers opened within a session. A single mongod will have two cursors (ID X and ID Y) open which are pulling data from the same Exchange. If there are two active getMores running, one for X and one for Y, only one can hold the session state at a time, so only one can proceed into the exchange at once. Imagine the getMore for X wins the race and checks out the session, but then finds that the buffer for its part of the exchange output is empty. This thread will then iterate the input stage to the Exchange in an attempt to fill up it's buffer. However, it might find that all the subsequent results should go into the buffer for the consumer feeding into cursor Y, and that buffer is full. In this case, the thread for the getMore on X has to wait until the consumer for cursor Y consumes the results before it can proceed. Of course, that getMore cannot proceed because it first needs to check out the session state - thus we have a deadlock.

Two possible solutions we have thought of thus far:
1. Once the thread on cursor id X begins to wait for another thread to consume results, it should check back in its session state. Only once it has been signaled to proceed should it re-acquire the session state.

2. We should somehow set it up so that threads which will consume output of an exchange do not check out the session catalog state (we don't think they will need it since they do not interact with the storage engine, and further such operations would always be banned from operating within a transaction). Only when the thread has been designated to generate input and partition it or otherwise distribute it among the buffers should it actually check out the session state.

 

Note we have not observed such a scenario before and also that there may be other possible remedies. This is very related to the issue described in SERVER-33683.



 Comments   
Comment by Githook User [ 20/Dec/18 ]

Author:

{'username': 'benety', 'email': 'benety@mongodb.com', 'name': 'Benety Goh'}

Message: SERVER-37499 fix lint (stdx)
Branch: master
https://github.com/mongodb/mongo/commit/a35cfea6e7443769a1620f2324b4bf933b731ea1

Comment by Githook User [ 19/Dec/18 ]

Author:

{'username': 'benety', 'email': 'benety@mongodb.com', 'name': 'Benety Goh'}

Message: SERVER-37499 fix lint
Branch: master
https://github.com/mongodb/mongo/commit/c959f0d1381baba1bf479c3515122ebfc4c39d45

Comment by Githook User [ 19/Dec/18 ]

Author:

{'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}

Message: SERVER-37499 prevent deadlock within Exchange during transaction
Branch: master
https://github.com/mongodb/mongo/commit/4eee17a5fdc14af2c3770b01cc4f906fa3620fe5

Comment by Charlie Swanson [ 13/Dec/18 ]

craig.homa ditto concern from the other ticket in this epic - why is this in a February sprint?

Comment by Charlie Swanson [ 22/Oct/18 ]

We realized after discussion on Friday that this deadlock cannot happen today because (1) the session is only checked out if there is a txnNumber attached to the request and (2) we would ban such a request with a $out, and also more generally any aggregation which involves merging on a shard (SERVER-33683). Because this is not a real issue today, and because it depends on other work outside this epic, I am removing this ticket from the "Improve $out" epic and putting it inside the cross-shard transactions epic with SERVER-33683. The query team is happy to implement when this is unblocked, just let us know and we can try to work it into our schedule.

Comment by David Storch [ 19/Oct/18 ]

charlie.swanson to clarify expected delivery date for SERVER-37665.

Generated at Thu Feb 08 04:46:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.