[SERVER-21997] kill_cursors.js deadlocks Created: 21/Dec/15  Updated: 16/Aug/22  Resolved: 22/Dec/15

Status: Closed
Project: Core Server
Component/s: MMAPv1, Querying
Affects Version/s: None
Fix Version/s: 3.2.3, 3.3.0

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: David Storch
Resolution: Done Votes: 0
Labels: test-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-68874 Consider making waitAfterPinningCurso... Closed
is related to SERVER-21600 Increase test coverage for killCursor... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: QuInt E (01/11/16)
Participants:

 Description   

In MMAP tests, the new failpoint in kill_cursors.js can deadlock with journal flush, since it is spinning in a tight while loop while holding a database lock.

Example callstacks:
https://evergreen.mongodb.com/task/mongodb_mongo_master_suse12_jsCore_a3d8fbcadfeae8418ba17e5e90e51b929ff3ff93_15_12_18_22_40_14



 Comments   
Comment by Githook User [ 11/Jan/16 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-21997 periodically drop locks while keepCursorPinnedDuringGetMore fail point is enabled

(cherry picked from commit cff8decf7ecebb69f82231c994a8b1a52234ba08)
Branch: v3.2
https://github.com/mongodb/mongo/commit/6fc3b8d0c6a4be82f2b29a222e9eaa2c66f7660f

Comment by Githook User [ 22/Dec/15 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-21997 periodically drop locks while keepCursorPinnedDuringGetMore fail point is enabled
Branch: master
https://github.com/mongodb/mongo/commit/cff8decf7ecebb69f82231c994a8b1a52234ba08

Comment by David Storch [ 22/Dec/15 ]

kill_cursors.js includes a test for killing a pinned cursor. Since cursors are generally pinned for short periods of time (i.e. the time required to complete the getMore operation against that cursor), the test enables a fail point which causes the getMore thread to busy wait after pinning the cursor.

This can cause deadlock as follows:

  • The test enables the keepCursorPinnedDuringGetMore fail point.
  • It then runs a getMore command against an existing cursor id. Once the cursor is pinned, the getMore will spin until the fail point is disabled. While spinning, the thread servicing the getMore is holding the MMAPv1 flush lock in shared mode.
  • The MMAPv1 dur thread attempts to acquire the flush lock in exclusive mode. It blocks waiting for the getMore to release its shared lock.
  • The test runs a killCursors command in order to test killing the pinned cursor. The thread servicing the killCursors command attempts to acquire the MMAPv1 flush lock in shared mode.
  • At this point there is a deadlock: the getMore is waiting to be killed, the killCursors is waiting for the dur thread to finish, and the dur thread is waiting for the getMore.
Comment by Mark Benvenuto [ 22/Dec/15 ]

It also hangs on my PPC64le. The blocking thread is holding a lock that causes the killCursor, clientCursorManager, and TTLMonitor to wait on.

Generated at Thu Feb 08 03:59:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.