[SERVER-25868] Incomplete FTDC files on Windows Created: 30/Aug/16  Updated: 28/Apr/17  Resolved: 07/Sep/16

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: 3.2.0, 3.2.9, 3.3.9, 3.3.10, 3.3.11
Fix Version/s: 3.2.13, 3.3.14

Type: Task Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 338.png     PNG File 339.png     File dd-338.tar     File dd-339.tar     PNG File ftdc-stacks.png    
Issue Links:
Backports
Related
is related to SERVER-25865 $group operation is slow since MongoD... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.2
Sprint: Platforms 2016-09-19
Participants:

 Description   

Using the test case from SERVER-25865, in 3.3.9 and later, for the duration of the unusally slow $group operation, marked by the interval from A to B below, no FTDC samples are captured:

This did not occur in 3.3.8:

Since the $group operation in the test case from SERVER-25865 is only slow on Windows, we can't tell from this test case whether this problem only affects Windows or also affects Linux.

Also note that while the FTDC stall only occurs in 3.3.9 and later, the slow $group operation on Windows from SERVER-25865 occurs all the way back to 3.2.

FTDC data for 3.3.8 and 3.3.9 is attached.



 Comments   
Comment by Githook User [ 17/Apr/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-25868 Incomplete FTDC files on Window

(cherry picked from commit f73ae5650e6fb65268f26c8eeb9dc6aa0a2f474f)
Branch: v3.2
https://github.com/mongodb/mongo/commit/204f1ea4992d45c561a888967d791751328cd677

Comment by Githook User [ 17/Apr/17 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-25868 Incomplete FTDC files on Windows

(cherry picked from commit b5277068ff66e7fe2b2a98144aa6993ca6e17f5c)
Branch: v3.2
https://github.com/mongodb/mongo/commit/145be8c08243566275b86eefa8939f05eb823173

Comment by Githook User [ 07/Sep/16 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-25868 Incomplete FTDC files on Window
Branch: master
https://github.com/mongodb/mongo/commit/f73ae5650e6fb65268f26c8eeb9dc6aa0a2f474f

Comment by Githook User [ 07/Sep/16 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-25868 Incomplete FTDC files on Windows
Branch: master
https://github.com/mongodb/mongo/commit/b5277068ff66e7fe2b2a98144aa6993ca6e17f5c

Comment by Bruce Lucas (Inactive) [ 30/Aug/16 ]

It looks like the problem of incomplete metrics.* files was present in 3.2 as well - confirmed on 3.2.9 and 3.2.0.

Comment by Bruce Lucas (Inactive) [ 30/Aug/16 ]

Good catch; looks like the 3.3.8 metrics.2016-* file (but not metrics.interim file) is also short (last BSON document incomplete). The lengths of both files are multiples of 4 kB, so I suspect an incomplete write somewhere. I imagine the difference between 3.3.8 and 3.3.9 in my test relates to the exact chunk structure of the FTDC file, and the alignment with the $group command relates to that chunk structure as well, since execution of a command could change the serverStatus schema and require starting a new chunk. In other words, looks like a more general issue with incomplete FTDC files, and goes back prior to 3.3.9.

Comment by Mark Benvenuto [ 30/Aug/16 ]

I have not been able to repro this on 6 core x64, 32 GB RAM Windows 10 with either 3.3.9 or 3.3.12. BTW, the 3.3.9 diagnostic data file: metrics.2016-08-30T14-03-24Z-00000 is missing data as it comes up short for the last BSON document.

Here is the test I used:

db.collection.drop();
for (var i = 0; i < 40000; ++i) { db.collection.insert({x: Math.floor(Math.random()*1000000)});}
db.collection.createIndex({x: 1});
print("run 1");
var start = new Date().getTime(); db.collection.aggregate( [{$group: {_id: "$x", value: {$sum: 1}}}] ); var end = new Date().getTime(); var time = end - start; print(time);
sleep(5000);
print("run 2");
var start = new Date().getTime(); db.collection.aggregate( [{$group: {_id: "$x", value: {$sum: 1}}}] ); var end = new Date().getTime(); var time = end - start; print(time);
sleep(5000);
print("run 3");
var start = new Date().getTime(); db.collection.aggregate( [{$group: {_id: "$x", value: {$sum: 1}}}] ); var end = new Date().getTime(); var time = end - start; print(time);

Comment by Bruce Lucas (Inactive) [ 30/Aug/16 ]

On Windows. On Linux the $group operation completes very quickly so we can't tell (from this test case) whether it also affects Linux.

Comment by Daniel Pasette (Inactive) [ 30/Aug/16 ]

was this taken on windows or linux?

Comment by Bruce Lucas (Inactive) [ 30/Aug/16 ]

Periodic stack traces captured on Windows during the test run show the slow $group operation, marked by the blue lines below. However during this period, when no FTDC samples were captured, we don't see the FTDC thread doing anything unusual - as expected, we never see it doing anything but sleeping waiting to produce the next sample. Maybe for some reason the sleep isn't waking up at the expected time during the slow $group operation?

Generated at Thu Feb 08 04:10:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.