[SERVER-48126] kill_pinned_cursor.js is not robust to periodic sharded index consistency checker Created: 12/May/20  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: sharding-nyc-subteam2, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-48502 Tighten $currentOp and pinned cursor ... Closed
Assigned Teams:
Cluster Scalability
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 0
Story Points: 2

 Description   

kill_pinned_cursor.js has fairly loose targeting of its getMores here and here. If there is another getMore additionally running on the system (eg. from the periodic sharded index consistency checker), then this can cause the $currentOp to return more than 1 matching getMore, making the test fail here or here.

Similarly, if there is some other internal getMore running on the system when the parallel shell is started and it has cursors pinned, then the sleep to wait for the parallel shell to startup can short-circuit. In conjunction with the above loose targeting, this can cause the killFunc to kill the internal getMore, instead of the test one. This causes the test getMore to never be interrupted, and since it is waiting on a failpoint that is only switched off after the getMore returns, this means the test is deadlocked and will time out.

Based on this kill_pinned_cursor.js should be updated to more accurately use $currentOp to find only the test getMore. eg. have the original find include a dummy query predicate (that increments for each test) such as "foo1": {$exists: false}, and then have the $currentOp only look for getMores with that query predicate in the originating command.



 Comments   
Comment by Kevin Pulo [ 03/Jun/20 ]

Note that SERVER-48502 has fixed the issue with incorrect waiting for failpoints, but not the targeting of the killProc functions to kill the correct cursorId. So this ticket remains open to do this latter work.

Generated at Thu Feb 08 05:16:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.