[SERVER-41327] Investigate a query failure with CursorKilled along with failure in obtaining lock for stats collection Created: 27/May/19  Updated: 27/Oct/23  Resolved: 12/Aug/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Miguel Angel Nieto Assignee: Anton Korshunov
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-41351 Improve error message from failure to... Closed
Sprint: Query 2019-07-01, Query 2019-07-15, Query 2019-07-29, Query 2019-08-12, Query 2019-08-26
Participants:
Case:

 Description   

When a long running query is executed MongoDB tries to acquire storage statistics. If the lock request for that operation times out, it fails with the following message:

[conn4650] Timed out obtaining lock while trying to gather storage statistics for a slow operation

Then, we can see that the cursor is killed:

Error in $cursor stage :: caused by :: operation was interrupted" errName:CursorKilled errCode:237 reslen:292

That query needed 257292ms to execute, to then get killed because it couldn't acquire a global lock to gather statistics (500ms timeout).

I think that in the cases where the lock request timeouts MongoDB should skip the gathering instead of killing the cursor.

Edit:
We already skip storage statistics collection in case of a failure to obtain MODE_IS global lock. In such a case an error is reported but the query proceeds as normal.
This ticket should root cause the reason why the query failed with CursorKilled. A failure to obtain MODE_IS global lock might be hinting towards the state of the system.



 Comments   
Comment by Anton Korshunov [ 12/Aug/19 ]

As agreed, closing this ticket for now. Feel free to reopen if you need further assistance.

Comment by Miguel Angel Nieto [ 02/Aug/19 ]

Thanks for the analysis. Just to add something extra, there is no mongos here. It is a 3 nodes replica set.

I will contact the customer.

Comment by Sulabh Mahajan [ 28/May/19 ]

miguel.nieto, I do not think the query got killed because of failure to acquire a lock here.
It will help if you can share logs from when this occurred and any steps that could help reproduce this issue.

Comment by Sulabh Mahajan [ 27/May/19 ]

            try {                                                                
                Lock::GlobalLock lk(opCtx,                                       
                                    MODE_IS,                                     
                                    Date_t::now() + Milliseconds(500),           
                                    Lock::InterruptBehavior::kLeaveUnlocked);    
                if (lk.isLocked()) {                                             
                    _debug.storageStats = opCtx->recoveryUnit()->getOperationStatistics();
                } else {                                                         
                    log(component) << "Timed out obtaining lock while trying to gather storage "
                                      "statistics for a slow operation";         
                }                                                                
            } catch (const ExceptionForCat<ErrorCategory::Interruption>&) {      
                log(component)                                                   
                    << "Interrupted while trying to gather storage statistics for a slow operation";
            }                                                                    
        }                                                                        

I think that in the cases where the lock request timeouts MongoDB should skip the gathering instead of killing the cursor.

That is how we intend it to work. From the code referenced above, if we do not get the lock, we skip storage statistics collection and proceed with report generation. Failure to get the lock or a lock interrupt results in logging a message and nothing more.

Generated at Thu Feb 08 04:57:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.