[SERVER-67518] Aggregate metric value continually increments when no aggregates are run Created: 24/Jun/22  Updated: 27/Oct/23  Resolved: 11/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.0-rc11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Oliver Bucaojit Assignee: Allison Easton
Resolution: Gone away Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File data-issues.tar     File data-no-issues.tar     PNG File image-2022-07-07-14-03-13-999.png     PNG File lv with issues.png    
Issue Links:
Depends
Problem/Incident
Related
related to SERVER-66943 Do not run aggregation for orphans ag... Closed
related to SERVER-64242 Make `collStats` aggregation stage re... Closed
Operating System: ALL
Steps To Reproduce:

$ curator artifacts download  --target osx --arch x86_64 --version 6.0-stable --edition enterprise --path /tmp/v60
 
$ mlaunch init --replicaset  --binarypath /tmp/v60/mongodb-macos-x86_64-enterprise-6.0.0-rc11/bin
 
$ mongo 
# Expecting "total" to be 0
 
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(30) }
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(31) }
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(32) }
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(33) }
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(34) }
MongoDB Enterprise replset:PRIMARY> db.serverStatus().metrics.commands.aggregate
{ "failed" : NumberLong(0), "total" : NumberLong(35) }

Sprint: Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22
Participants:
Story Points: 3

 Description   

MongoDB v6.0
Replica set setup

db.serverStatus().metrics.commands.aggregate value increments about every second on all nodes of the replica set on a system with no aggregations run by user.

This behavior is different from version 5.3 and earlier versions.  We have tests where we expect the value to be 0 when no queries have been run, which fail now.



 Comments   
Comment by Allison Easton [ 11/Aug/22 ]

Perfect, I will close this then. Let me know if you need any further information

Comment by Oliver Bucaojit [ 10/Aug/22 ]

Thanks allison.easton@mongodb.com for the details and explanation. 

Yes the changes are sufficient, we are checking the aggregation values for the replica sets and this fix covers that case.  The options for setting the expected behavior on a sharded cluster will be helpful as well. 

Comment by Allison Easton [ 10/Aug/22 ]

Hi oliver.bucaojit@mongodb.com and chris.kelly@mongodb.com.
To summarize the above comment, this has already been changed in master and 6.0 and will be included in the next release of 6.0. This will no longer happen on replica sets but on sharded clusters, the aggregation counter can increase during step up, but not continuously as it was on previous releases of 6.0.
Is this sufficient, or is there something else needed from this ticket?

Comment by Allison Easton [ 28/Jul/22 ]

Hi chris.kelly@mongodb.com , I can give some information on why this is happening and on how to work around it if needed. The behavior has changed recently on master and the 6.0 branch, I have included a description of what behavior to expect where.

The aggregation in question was added to the collstats command to return the number of orphaned documents as part of the collstats output. The collstats command is run as a part of gathering ftdc data, which is why the aggregation happens about once a second. It was added in 6.0, which is why this doesn't happen on 5.3.

On 6.0.0, this aggregation is run every time collstats is called (both on replica sets and sharded clusters). On master and the current 6.0 branch (but not the released version of 6.0), this aggregation is skipped for replica sets and run much less often for sharded clusters.

On replica sets, the aggregation was removed by SERVER-66943 on master and by BACKPORT-12944 on 6.0. After these commits, the aggregation counter should not increase on a replica set due to this aggregation.

On sharded clusters, the aggregation was made less common by SERVER-64242 on master and BACKPORT-12945 on 6.0. After these commits, the aggregation will only be run on startup and during step up, when an internal component of sharding is still being initialized.

One option to prevent the aggregation on sharded clusters or on replica sets before BACKPORT-12944 would be to disable FTDC. This way the extra aggregations would only happen if collstats is called directly.

Disabling FTDC can be done by running using the setParameter flag on all nodes setting “diagnosticDataCollectionEnabled to 0.

Ex: mongod --setParameter “diagnosticDataCollectionEnabled=0”

If this option is passed as a startup parameter for the nodes, then the aggregation count should be 0, same as the the behavior before 6.0. If the option is set after starting the node, it will prevent any more aggregations from happening, but there will likely be some that happened before the parameter was set, making the value greater than 0.

Comment by Chris Kelly [ 07/Jul/22 ]

At log level 2, this is what's getting captured in the problematic 6.0 copy. 

Whereas there is absolutely nothing captured on the working 6.0 copy I shared.

I see some tests that pertain to rangeDeletions, orphans, and the FCV value here:

https://github.com/mongodb/mongo/commits/9022ee2c1454336265e3f50d2bf43a86ec56c0e9/jstests/sharding/range_deletions_setFCV.js

It relies on the FCV value being 6.0 to trigger. If you don't have that set to 6.0, it won't happen. If you change it from 6.0 to something else it'll stop incrementing.

Comment by Chris Kelly [ 07/Jul/22 ]

I've noticed something strange with this. I observe the reported behavior on both community and enterprise 6.0.0-rc11 on Evergreen, as well as externally on Ubuntu 20.04 in WSL.

However, somehow I have managed to create a situation where 6.0 does not increment this metric at first which is reproducible. I used m/mlaunch to initiate a data folder while running 6.0.0-rc11. In one, (data-issues.tar), the metric will increase as described. However, running the other will not see this, even running the same version. To run:

  • Download tar
  • Extract (tar -xvf) and rename folder to "data"
  • Go into .mlaunch_startup and replace all instances of "likai" with your username (ubuntu on Evergreen by default)
  • Install 6.0.0-rc11 (using m: 6.0.0-rc11 or 6.0.0-rc11-ent)
  • mlaunch start
  • repeat for other folder to observe different behavior when running your metric command

I also tested this on 5.3 and some other 6.0 rc's. It was not present on 5.3, but was present on all 6.0 ones I tested.

CURRENTLY TESTING:

SPECULATION:

I'm not sure what I managed to do this to make 6.0 not increment this metric with this data folder. The only modification I recall I made was updating the required libraries used for enterprise mongodb using https://www.mongodb.com/docs/v6.0/tutorial/install-mongodb-enterprise-on-ubuntu-tarball/. Then subsequent reinstallation using m (doing an m rm 6.0.0-rc11, then m 6.0.0-rc-11) + another mlaunch init made subsequent versions see the issue. However, if this was the case I would be confused why I can swap between these two data folders and observe different behavior on the same environment.

Generated at Thu Feb 08 06:08:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.