[SERVER-58356] Cannot kill the dataSize operation Created: 07/Jul/21  Updated: 29/Oct/23  Resolved: 23/Sep/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.6
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Ivan Grigolon Assignee: Denis Grebennicov
Resolution: Fixed Votes: 2
Labels: query-director-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-58358 Datasize command does not validate th... Closed
related to SERVER-71759 dataSize command doesn't yield Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Testing:

 

Create a significant amount of data so that you can run the datasize command and you have some time to attempt to kill the operation.

 
Run the datasize command and quickly find the op to kill (next step)

 MongoDB Enterprise > db.runCommand({ dataSize: "database.collection" })
{ "estimate" : false, "size" : 9393026191, "numObjects" : 3293296, "millis" : 14353, "ok" : 1 }

 

Find and kill the op (as you see the op does not get killed)

MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.killOp(1133128)
{ "info" : "attempting to kill op", "ok" : 1 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
{ "opid" : 1133128 }
MongoDB Enterprise > db.getSiblingDB("admin").aggregate( [{ $currentOp : { allUsers: true, idleSessions: true } },{$match: {"command.dataSize":  "database.collection"}},{$project:{opid:1}}] )
MongoDB Enterprise >
 

From the logs:

{"t":{"$date":"2021-07-07T11:25:07.476+10:00"},"s":"I",  "c":"COMMAND",  "id":20482,   "ctx":"conn12","msg":"Going to kill op","attr":{"opId":1133128}}
{"t":{"$date":"2021-07-07T11:25:07.477+10:00"},"s":"I",  "c":"COMMAND",  "id":20884,   "ctx":"conn12","msg":"Killed operation","attr":{"opId":1133128}}        >> This says killed, although the operation kept going
{"t":{"$date":"2021-07-07T11:25:16.670+10:00"},"s":"W",  "c":"COMMAND",  "id":20525,   "ctx":"conn17","msg":"Failed to gather storage statistics for slow operation","attr":{"opId":1133128,"error":"lock acquire timeout"}}
{"t":{"$date":"2021-07-07T11:25:16.671+10:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn17","msg":"Slow query","attr":{"type":"command","ns":"database.collection","appName":"MongoDB Shell","command":{"dataSize":"database.collection","lsid":{"id":{"$uuid":"12825971-559f-4a68-9022-86bc601ca7ac"}},"$db":"test"},"numYields":0,"reslen":91,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":14353}}.  >> THIS indicate that the operation started around "2021-07-07T11:25:02 more or less"
{"t":{"$date":"2021-07-07T11:25:25.269+10:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn17","msg":"Connection ended","attr":{"remote":"127.0.0.1:58530","connectionId":17,"connectionCount":2}}

  
 
 
This is a small test so the difference in time between the operation was attempted to be killed and the operation completed is small. But for customers with a larger data set the only solution is to restart the node.

Sprint: QE 2021-08-09, QE 2021-08-23, QE 2021-09-20, QE 2021-10-04
Participants:
Case:

 Description   

 Cannot kill the dataSize operation, see steps below to reproduce.



 Comments   
Comment by Githook User [ 21/Sep/21 ]

Author:

{'name': 'Denis Grebennicov', 'email': 'denis.grebennicov@mongodb.com', 'username': 'denis631'}

Message: SERVER-58356 Cannot kill the dataSize operation
Branch: master
https://github.com/mongodb/mongo/commit/17248c3102c9e0ded724069c34f3e43a7da6382b

Comment by Kyle Suarez [ 24/Aug/21 ]

After discussion in Query Execution triage, we suspect that the original author of the command wrote it without yielding because they wanted to guarantee an accurate data size. If the query were to yield, resuming the data size operation means that the final size could potentially return a number that doesn't represent the actual size of the collection at any point in time.

ian.boros suggests that changing the yield policy here to INTERRUPT_ONLY would allow the operation to be interruptible without forcing it to yield, so we could still get an accurate count otherwise.

Moving this into the Quick Wins bucket. When we pick this up we need to ensure that any change to the yield policy has the intended interruptible effect without affecting data size correctness when the command is not interrupted.

Comment by Kyle Suarez [ 20/Aug/21 ]

After a quick chat with Benety and some code examination, I am inclined to think this code does lie in the Query Execution realm:

During the execution of the command, we initiate collection or index scans with PlanYieldPolicy::YieldPolicy::NO_YIELD. We then manually iterate over the PlanExecutor in a loop and not once check for interrupt:

        try {
            RecordId loc;
            while (PlanExecutor::ADVANCED == exec->getNext(static_cast<BSONObj*>(nullptr), &loc)) {
                if (estimate)
                    size += avgObjSize;
                else
                    size += collection->getRecordStore()->dataFor(opCtx, loc).size();
 
                numObjects++;
 
                if ((maxSize && size > maxSize) || (maxObjects && numObjects > maxObjects)) {
                    result.appendBool("maxReached", true);
                    break;
                }
            }
        } catch (DBException& exception) {

Sending back to the Query Execution triage queue for scheduling and discussing the viability of

  • updating the yield policy used in the collection and index scans; and
  • checking for interrupt during that manual loop or otherwise replacing the loop with a mechanism that respects yielding or interruption.
Comment by Ana Meza [ 19/Aug/21 ]

Kyle, could you please find out who own the data size command

Generated at Thu Feb 08 05:44:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.