[SERVER-40110] ClusterCursorManager::CursorEntry::isKillPending() should not call checkForInterrupt Created: 13/Mar/19  Updated: 29/Oct/23  Resolved: 29/Jul/19

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 4.0.0
Fix Version/s: 4.0.13, 4.2.1, 4.3.1

Type: Bug Priority: Critical - P2
Reporter: Mira Carey Assignee: Ian Boros
Resolution: Fixed Votes: 1
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-40100 Mongos hangs without any log messages Closed
Related
related to SERVER-40100 Mongos hangs without any log messages Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0
Sprint: Query 2019-04-08, Query 2019-04-22, Query 2019-05-06, Query 2019-05-20, Query 2019-06-03, Query 2019-06-17, Query 2019-07-01, Query 2019-07-15, Query 2019-07-29, Query 2019-08-12
Participants:
Linked BF Score: 30

 Description   

isKillPending()

          bool isKillPending() const {
              // A cursor is kill pending if it's checked out by an OperationContext that was
              // interrupted.
              return _operationUsingCursor &&
                  !_operationUsingCursor->checkForInterruptNoAssert().isOK();
          }

is calling checkForInterruptNoAssert.

That's a problem because:

  • the rule for opCtx is that the owning thread can call all methods, and external threads can call some methods, while holding the client lock
    • isKillPending is called from ClusterCursorManager::stats, which is called by ftdc, which doesn't hold any client locks
  • checkForInterruptNoAssert isn't meant to be called from off thread. opCtx->isKillPending() is the way to do that
  • checkForInterrupt can actually invoke markKilled, if the op is now timed out
  • markKilled does a dance with waitForConditionOrInterrupt that, when the latter is active, does:
    • an unlock of the client lock. This probably kicks the system into an unanticipated state
    • a lock of the mutex specified in waitForConditionOrInterrupt
    • a lock of the client lock
    • Setting the killcode
    • unlocking the mutex specified in waitForConditionOrInterrupt

I think this doesn't usually come up because the ops using cluster cursors usually haven't exceeded maxtimems. That may be an avenue for an isolated repro.



 Comments   
Comment by Githook User [ 09/Sep/19 ]

Author:

{'name': 'Ian Boros', 'username': 'puppyofkosh', 'email': 'ian.boros@mongodb.com'}

Message: SERVER-40110 don't call OpContext::checkForInterrupt() off-thread

This commit also includes a modification to the test done under SERVER-43156.
Branch: v4.0
https://github.com/mongodb/mongo/commit/f2492afb1c5a1c50406890791d5f22ea0ae10be7

Comment by Githook User [ 06/Sep/19 ]

Author:

{'name': 'Ian Boros', 'username': 'puppyofkosh', 'email': 'ian.boros@mongodb.com'}

Message: SERVER-40110 don't call OpContext::checkForInterrupt() off-thread

This commit also includes a modification to the test done under SERVER-43156.
Branch: v4.2
https://github.com/mongodb/mongo/commit/19509c977c96ded7b3e7c954fc5f8885065a0d9f

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Ian Boros', 'username': 'puppyofkosh', 'email': 'puppyofkosh@gmail.com'}

Message: SERVER-40110 don't call OpContext::checkForInterrupt() off-thread
Branch: master
https://github.com/mongodb/mongo/commit/7a4fce6cfde41529c447417318e9c79ae42e92f0

Generated at Thu Feb 08 04:54:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.