[SERVER-50177] snapshot_read_at_cluster_time_crud_operations.js fails with CursorNotFound Created: 07/Aug/20  Updated: 29/Oct/23  Resolved: 16/Dec/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-49880 snapshot_read_at_cluster_time_crud_op... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2020-12-14, Repl 2020-12-28
Participants:
Linked BF Score: 12

 Description   

Similar to SERVER-49880. At random times, fsm_workloads/snapshot_read_at_cluster_time_crud_operations.js searches currentOp for getMore operations and calls killOp on them, which causes them to fail with "ShutdownInProgress" or "Interrupted". Sometimes, getMore fails with CursorNotFound (see Suganthi's explanation below). Update the test to handle this scenario.



 Comments   
Comment by Githook User [ 16/Dec/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-50177 snapshot_read_at_cluster_time_crud_operations.js should expect CursorNotFound
Branch: master
https://github.com/mongodb/mongo/commit/6ee6a9d6f6ff15fc65bf13baeab3717dfb72eb20

Comment by A. Jesse Jiryu Davis [ 13/Dec/20 ]

Reopening the original CR.

Comment by Lingzhi Deng [ 20/Nov/20 ]

This makes sense. Thanks for the investigation, Suganthi.

Comment by Suganthi Mani [ 20/Nov/20 ]

I think, the original patch in this CR got abandoned due to the below question raised by lingzhi.deng in the CR

I think killOp only interrupts the operations in progress. And yes, the
interrupted getMore would clean up the cursor. But then the getMore should get
an Interrupted error and the test should stop running getMore using the same
cursor:
https://github.com/mongodb/mongo/blob/915402884c52da861b1660cd6a7172c552ce1806/jstests/concurrency/fsm_workload_helpers/snapshot_read_utils.js#L158-L160

Am I missing something here? I have concerns adding CursorNotFound as an
expected error code here because this might mask out real server bugs in cases
where the server does mistakenly clean up in progress snapshot reads cursors
when it shouldnt.

If killOp is an issue here, I would probably prefer to give up the test
coverages of interacting with killOp. But it is still not clear to me how
exactly we got CursorNotFound.

When a find/getMore command gets interrupted due to KillOp command after it has generated the batch result(say at line 545for find cmd and line 715 for getMore cmd), the result returned by find/getMore command will have cursor.id in response data. But, when ClientCursorPin destructor is called, the find/getMore notices that it got interrupted which makes it to remove the cursor id from the cursorMap and destroy the cursor. Only the subsequent getMore command will notice that cursor id is not in the cursorMap and would return ErrorCodes::CursorNotFound. So, it's valid for this line in the test snapshot_read_at_cluster_time_crud_operations.js to fail with ErrorCodes::CursorNotFound. 

Reopening this ticket so that the closed CR can be made active again.
Note, we have similar test snapshot_read_kill_operations.js which also kills the find/getMore cmd and ErrorCodes.CursorNotFound is an acceptable error for the getMore cmd.

Comment by A. Jesse Jiryu Davis [ 10/Aug/20 ]

Closing this ticket and unassigning the BF from myself, perhaps someone else can see the problem from a fresh perspective.

Generated at Thu Feb 08 05:21:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.