[SERVER-79434] Bug running spill to disk in mongosync passthroughs Created: 27/Jul/23  Updated: 16/Oct/23  Resolved: 16/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rohan Sharan Assignee: Ivan Fefer
Resolution: Won't Fix Votes: 0
Labels: greenerbuild
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Query Execution
Operating System: ALL
Sprint: QE 2023-10-16, QE 2023-10-30
Participants:

 Description   

I tried re-enabling spill_to_disk after SERVER-77185 was done and backported, and even with the backports, I'm still seeing problems. Here are the problematic logs: https://parsley.mongodb.com/resmoke/e3675d4e2151a916de0ca82a4fb29a9c/all?bookmarks=0,4585,129264,173946&filters=000localhost%253A20753,000js_test_try,000j3%253Acl0%253As1%253Aprim&shareLine=129264

 

Essentially it seems that at some point when the test is running j3:cl0:s1:prim throws a QueryExceededMemoryLimitNoDiskUseAllowed, and then it never recovers or is reachable by the other nodes.



 Comments   
Comment by Ivan Fefer [ 16/Oct/23 ]

Unable to reproduce the issue. It looks like the problem is outside spill_to_disk.js test and the server.

Comment by Ivan Fefer [ 09/Oct/23 ]

Also, latest execution succeeded, but accumulator_js_size_limits.js took 220 seconds to complete with a single query taking this much most of the time.
[j1:cl0:s1:prim] | 2023-10-09T14:18:18.877+00:00 I COMMAND 51803 [conn64] "Slow query","attr":{"type":"command","ns":"staging.watermelon","appName":"MongoDB Shell","command":{"aggregate":"watermelon","pipeline":[{"$addFields":{"a":

{"$range":[0,1000000]}

}},{"$unwind":"$a"},{"$group":{"_id":"$groupBy","count":{"$accumulator":{"init":{"$code":"function()

{\n return \"\";\n }

"},"accumulateArgs":[\{"fieldName":"$a"}],"accumulate":{"$code":"function(state, args)

{\n return state + \"a\";\n }

"},"merge":{"$code":"function(state1, state2)

{\n return state1 + state2;\n }

"},"finalize":{"$code":"function(state)

{\n return state.length;\n }

"}}}}},{"$limit":10}],"cursor":{"batchSize":101},"let":{},"fromMongos":false,"$_generateV2ResumeTokens":false,"readConcern":{"level":"local","provenance":"implicitDefault"},"writeConcern":{"w":"majority","wtimeout":0,"provenance":"implicitDefault"},"shardVersion":{"t":{"$timestamp":{"t":0,"i":0}},"e":{"$oid":"000000000000000000000000"},"v":{"$timestamp":

{"t":0,"i":0}

}},"databaseVersion":{"uuid":

{"$uuid":"5952481b-83b1-47af-b8d4-ba9c5bcf7ffc"}

,"timestamp":{"$timestamp":{"t":1696860872,"i":1}},"lastMod":1},"lsid":{"id":

{"$uuid":"bd5169cf-043c-4e09-8049-0f8db356c9fd"}

,"uid":{"$binary":

{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}

}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1696860879,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1696860877,"i":1}},"$topologyTime":{"$timestamp":{"t":1696860862,"i":11}},"$client":{"application":

{"name":"MongoDB Shell"}

,"driver":{"name":"MongoDB Internal Client","version":"6.0.8-43-g806d10c"},"os":{"type":"Linux","name":"Red Hat Enterprise Linux release 8.0 (Ootpa)","architecture":"x86_64","version":"Kernel 4.18.0-80.1.2.el8_0.x86_64"},"mongos":{"host":"ip-10-128-213-98.ec2.internal:20255","client":"127.0.0.1:57474","version":"6.0.8-43-g806d10c"}},"mayBypassWriteBlocking":false,"$db":"staging"},"planSummary":"COLLSCAN","numYields":0,"queryHash":"9B1F31E9","queryFramework":"classic","ok":0,"errMsg":"PlanExecutor error during aggregation :: caused by :: $accumulator arguments exceed max BSON size: 16777228","errName":"Location4545000","errCode":4545000,"reslen":396,"locks":{"FeatureCompatibilityVersion":{"acquireCount":{"r":2}},"Global":{"acquireCount":{"r":2}},"Mutex":{"acquireCount":

{"r":3}

}},"readConcern":{"level":"local","provenance":"implicitDefault"},"writeConcern":{"w":"majority","wtimeout":0,"provenance":"implicitDefault"},"storage":{},"remote":"127.0.0.1:47074","protocol":"op_msg","durationMillis":219626}
 

Comment by Ivan Fefer [ 09/Oct/23 ]

I have restarted the task couple of time and it has failed 3 times out of 6.

Failures  are similar to the original failure, however it seems that spill_to_disk.js test finished successfully in this runs, but they hanged on accumulator_js_size_limits.js

The common denominator of both tests is that they both spill data to disk. And according to logs hanging or death of one of the mongod processes happened after some data was spilled to disk.

The lack of logs suggests the process either stuck waiting for something or is killed with SIGKILL or something similar any other failures usually leave at least "Segmentation Fault (core dumped)" in the worst case and a full stack trace and an error message in the best case.

I suspect that the problem is on the side of environment: disk ran out of inodes, process exceeded limit on open file descriptor, something like that, not on the mongod side.

Generated at Thu Feb 08 06:40:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.