[SERVER-26171] Gradual Degradation of Performance over days Created: 19/Sep/16  Updated: 06/Jan/17  Resolved: 06/Jan/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.0.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Tewner Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux on AWS - Sharded Replica Set - 2 Shards, 3-node Replica Sets


Attachments: PNG File Screen Shot 2016-09-19 at 11.05.49 AM.png     PNG File Screen Shot 2016-10-06 at 7.10.12 PM.png     PNG File Screen Shot 2016-10-06 at 7.10.36 PM.png     Text File mongo.txt     Text File shard1iostat.log     File shard1ss.tar.gz     Text File shard2iostat.log     File shard2ss.tar.gz    
Operating System: ALL
Participants:

 Description   

Wired Tiger performance degrades over time:

I've attached a graph of CPU "idle" time.

The first 2/3 of the attached graph shows our database primaries working harder each day. The actual number of DB Ops is generally stable over this time period.

On September 10th, I bounce the primaries to a separate set of servers. These new servers also experience gradual degradation, and the previous primaries (now secondaries) continue to have less idle time compared to the new primaries, and the other set of secondaries.

On September 13th, I upgraded the servers to larger instances, and reboot a bunch of servers. After a reboot, all servers start behaving normally, but the new masters are now starting to slow down.

(1) Has anyone seen something like this before?
(2) How can I help track this down?
-Mike



 Comments   
Comment by Kelsey Schubert [ 06/Jan/17 ]

Hi tewner,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you after upgrading, please provide additional information and we will reopen the ticket.

Kind regards,
Thomas

Comment by Kelsey Schubert [ 18/Nov/16 ]

Hi tewner and may,

We have not been able to identify the root cause of the gradual change in cpu load. My understanding is that this does not impact the performance of the cluster, as the number database operations remains consistent. If this is still an issue for you, would you please consider upgrading to MongoDB 3.2 or test the latest MongoDB 3.4 release candidate? These branches contain many improvements, which may significantly impact the behavior you are observing

Thank you,
Thomas

Comment by may zahav [ 06/Oct/16 ]

Hi Thomas Schubert,

we are doing a weekly backup of mongo. we took a look at grafana and it seemed that a day after the backup the servers has less load then usual, and every day until the next backup there is more and more load. i attached a printscreen of the graph of cpu idle in time range of 21 days.
Furthermore we looked at the graph of the dirty cache in the exact same time range and saw that if relatively stable(also attached).

Thank you!

Comment by Kelsey Schubert [ 27/Sep/16 ]

Thanks for uploading the logs, may. We're still investigating this issue and will update this ticket when we know more.

Comment by may zahav [ 22/Sep/16 ]

Hi, i attached the logs with the relevant output.

Thank you, May.

Comment by Kelsey Schubert [ 21/Sep/16 ]

Hi may,

Thanks for providing the output of the command. I see that the maximum tree depth of this collection is 3, which indicates that this collection is not affected by SERVER-21063.

Unfortunately, to continue to investigate, we will need to collect additional metrics. If you are unable to upgrade to MongoDB 3.2.9 as I suggested, would you please run the following script on your Primary?

delay=1
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000*${delay:?})}" >ss.log &
iostat -k -t -x ${delay:?} >iostat.log &

This will collect much of the same information that is captured by later version of MongoDB, and will allow us to continue to investigate this behavior. Please leave this script running for a day before uploading the files so we can see the performance degradation in the stats.

Thank you for your help,
Thomas

Comment by may zahav [ 20/Sep/16 ]

Hi, I've attached a file of the "db.collection.stats(

{indexDetails:true}

)" command output. The file contain output for both shards.
Thank you, May.

Comment by Kelsey Schubert [ 20/Sep/16 ]

Hi tewner,

It is possible that the behavior you are observing may be explained by SERVER-21063. If this is the case, I would recommend performing an initial sync to create balanced trees.

To determine if your data files are affected by this bug, please provide the following information:

  1. Please confirm which versions of MongoDB have run on top of your data files.
  2. Please provide the output of the following command against a collection that is accessed by your workload:

    db.collection.stats({indexDetails:true})

Thank you,
Thomas

Comment by Kelsey Schubert [ 19/Sep/16 ]

Hi tewner,

Thanks for reporting this behavior. Please note that there have been significant improvements to the performance of WiredTiger in MongoDB 3.2. In addition, MongoDB 3.2 includes full time diagnostic capture, which would help us to identify the cause of this behavior.

Are you able to upgrade to the latest version of MongoDB to see if it resolves the issue?

Kind regards,
Thomas

Comment by Michael Tewner [ 19/Sep/16 ]

Note that the "arb01" servers on the graph are full data nodes.

Generated at Thu Feb 08 04:11:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.