[SERVER-34781] Abandoned cursor in a transaction can block other operations Created: 01/May/18 Updated: 27/Oct/23 Resolved: 02/May/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Querying |
| Affects Version/s: | 4.0.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Charlie Swanson | Assignee: | Backlog - Replication Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | Download the attached 'repro.js', then run:
That will hang forever. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Suppose a transaction opens a cursor, then abandons it. The locks for that cursor will be held in intent mode, even between batches. Now suppose a drop for the database or collection comes in. That drop will "get in line" with a MODE_X lock. In order to be fair to such requests, that pending MODE_X acquisition will block all future intent acquisitions. This will prevent a killCursors from killing the cursor, since it will need to take a collection lock. |
| Comments |
| Comment by James Wahlin [ 02/May/18 ] |
|
Created |
| Comment by James Wahlin [ 02/May/18 ] |
I agree that we should change the order of operations when killing sessions. We currently:
We should instead:
As killing transactions will kill associated cursors, step 3 would be there only to kill cursors that were opened as part of the session but outside of a transaction. |
| Comment by Dianna Hohensee (Inactive) [ 02/May/18 ] |
|
tess.avitabile, we bump it to 3 hours for testing, so that it doesn't cause random failures on slow machines using transactions. See this code. Though not all our testing is covered by that setting, |
| Comment by Tess Avitabile (Inactive) [ 02/May/18 ] |
|
james.wahlin, why would the transactionLifetimeLimitSeconds be 10800 at the start of the repro? I thought the default was 60. |
| Comment by Dianna Hohensee (Inactive) [ 02/May/18 ] |
|
On a related note, over in |
| Comment by Tess Avitabile (Inactive) [ 02/May/18 ] |
|
I'm glad to hear the transaction timeout job will successfully kill the transaction. That is unfortunate that killCursors will block. I would also expect killSessions to block, since the first thing it does is kill all cursors for the session, which requires collection locks. I think it is expected behavior that a drop that is blocked behind a transaction will block other operations. The scope document for local snapshot reads explicitly says that catalog operations will block behind transactions. One approach we could take to prevent killSessions from blocking is to kill stashed transaction resources prior to killing cursors. Then the transaction would stop holding locks, so the drop could proceed, so the cursor kill could proceed. That might be something we want to look into. Even if killCursors did not block, killCursors is not sufficient to kill the transaction, since the transaction survives cursors kills and maintains its locks. |
| Comment by James Wahlin [ 02/May/18 ] |
|
The transaction kill mechanism is not triggering here because the transactionLifetimeLimitSeconds used is 10800 or 3 hours. Session::_transactionExpireDate is compared to Date_t::now() to determine whether a transaction should be aborted. The value for this is set at transaction start time. The attached script waits to reduce transactionLifetimeLimitSeconds to 1 second until after the transaction has been started, so it is created with a 3 hour expiration. Moving the setParameter above the transaction start addresses this allowing for transaction kill and MODE_X lock acquisition. |
| Comment by Eric Milkie [ 01/May/18 ] |
|
Also, while attempting to kill the cursor with the killCursor command |
| Comment by James Wahlin [ 01/May/18 ] |
|
I am surprised that the transaction timeout mechanism is blocked. When a transaction times out we call Session::abortArbitraryTransactionIfExpired() which will first release the transaction Lock and Recovery unit prior to attempting to kill associate cursors. It would be interesting to see where the transaction kill thread is blocked. |
| Comment by David Storch [ 01/May/18 ] |
|
spencer tess.avitabile, this feels like a candidate for a 4.0.0-rc0 fixVersion. Abandoning a cursor and then issuing killCursors on it isn't entirely unusual, and it seems like this could lead to a server that is in a "stuck" state. Please triage, and let me know if you'd like an assist from James or someone else on query. Nice work tracking this down charlie.swanson! |
| Comment by Spencer Brody (Inactive) [ 01/May/18 ] |
|
Hmm, the fact that it blocks the transaction timeout is the most worrisome part. |