[SERVER-34781] Abandoned cursor in a transaction can block other operations Created: 01/May/18  Updated: 27/Oct/23  Resolved: 02/May/18

Status: Closed
Project: Core Server
Component/s: Concurrency, Querying
Affects Version/s: 4.0.0-rc0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Charlie Swanson Assignee: Backlog - Replication Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js    
Issue Links:
Related
related to SERVER-35217 killSessions command attempts to kill... Closed
is related to SERVER-34795 killSessions should kill transactions... Closed
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

Download the attached 'repro.js', then run:

python buildscripts/resmoke.py --suites=replica_sets_jscore_passthrough repro.js

That will hang forever.

Participants:

 Description   

Suppose a transaction opens a cursor, then abandons it. The locks for that cursor will be held in intent mode, even between batches.

Now suppose a drop for the database or collection comes in. That drop will "get in line" with a MODE_X lock. In order to be fair to such requests, that pending MODE_X acquisition will block all future intent acquisitions. 

This will prevent a killCursors from killing the cursor, since it will need to take a collection lock. It looks like it will also prevent the background transaction timeout job from killing it.



 Comments   
Comment by James Wahlin [ 02/May/18 ]

Created SERVER-34795 to handle killSessions operation ordering. I will close this ticket as the reproduction script works as designed once the setParameter is moved to the correct location. 

Comment by James Wahlin [ 02/May/18 ]

One approach we could take to prevent killSessions from blocking is to kill stashed transaction resources prior to killing cursors. Then the transaction would stop holding locks, so the drop could proceed, so the cursor kill could proceed. That might be something we want to look into.

 I agree that we should change the order of operations when killing sessions. We currently:

  1. Kill operations
  2. Kill cursors
  3. Kill transactions

We should instead:

  1. Kill transactions
  2. Kill operations
  3. Kill cursors

As killing transactions will kill associated cursors, step 3 would be there only to kill cursors that were opened as part of the session but outside of a transaction.

Comment by Dianna Hohensee (Inactive) [ 02/May/18 ]

tess.avitabile, we bump it to 3 hours for testing, so that it doesn't cause random failures on slow machines using transactions. See this code. Though not all our testing is covered by that setting, SERVER-34595 will make the coverage complete.

Comment by Tess Avitabile (Inactive) [ 02/May/18 ]

james.wahlin, why would the transactionLifetimeLimitSeconds be 10800 at the start of the repro? I thought the default was 60.

Comment by Dianna Hohensee (Inactive) [ 02/May/18 ]

On a related note, over in SERVER-34732 I'm exploring what appears to be a deadlock where PeriodicRunnerASIO, which runs the periodic task to abort expired transactions, is waiting on a IS lock behind a drop cmd waiting for a X lock behind an inactive transaction with an IX lock.

Comment by Tess Avitabile (Inactive) [ 02/May/18 ]

I'm glad to hear the transaction timeout job will successfully kill the transaction. That is unfortunate that killCursors will block. I would also expect killSessions to block, since the first thing it does is kill all cursors for the session, which requires collection locks.

I think it is expected behavior that a drop that is blocked behind a transaction will block other operations. The scope document for local snapshot reads explicitly says that catalog operations will block behind transactions.

One approach we could take to prevent killSessions from blocking is to kill stashed transaction resources prior to killing cursors. Then the transaction would stop holding locks, so the drop could proceed, so the cursor kill could proceed. That might be something we want to look into.

Even if killCursors did not block, killCursors is not sufficient to kill the transaction, since the transaction survives cursors kills and maintains its locks.

Comment by James Wahlin [ 02/May/18 ]

The transaction kill mechanism is not triggering here because the transactionLifetimeLimitSeconds used is 10800 or 3 hours. Session::_transactionExpireDate is compared to Date_t::now() to determine whether a transaction should be aborted. The value for this is set at transaction start time. The attached script waits to reduce transactionLifetimeLimitSeconds to 1 second until after the transaction has been started, so it is created with a 3 hour expiration. Moving the setParameter above the transaction start addresses this allowing for transaction kill and MODE_X lock acquisition.

Comment by Eric Milkie [ 01/May/18 ]

Also, while attempting to kill the cursor with the killCursor command won’t work, won’t killing the Session work? Especially if an admin was trying to diagnose the problem, the currentOp output will show you the session to kill, not the cursor id to kill.

Comment by James Wahlin [ 01/May/18 ]

I am surprised that the transaction timeout mechanism is blocked. When a transaction times out we call Session::abortArbitraryTransactionIfExpired() which will first release the transaction Lock and Recovery unit prior to attempting to kill associate cursors. It would be interesting to see where the transaction kill thread is blocked.

Comment by David Storch [ 01/May/18 ]

spencer tess.avitabile, this feels like a candidate for a 4.0.0-rc0 fixVersion. Abandoning a cursor and then issuing killCursors on it isn't entirely unusual, and it seems like this could lead to a server that is in a "stuck" state. Please triage, and let me know if you'd like an assist from James or someone else on query.

Nice work tracking this down charlie.swanson!

Comment by Spencer Brody (Inactive) [ 01/May/18 ]

Hmm, the fact that it blocks the transaction timeout is the most worrisome part.

milkie james.wahlin

Generated at Thu Feb 08 04:37:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.