[SERVER-26171] Gradual Degradation of Performance over days Created: 19/Sep/16 Updated: 06/Jan/17 Resolved: 06/Jan/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.0.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Tewner | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux on AWS - Sharded Replica Set - 2 Shards, 3-node Replica Sets |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Wired Tiger performance degrades over time: I've attached a graph of CPU "idle" time. The first 2/3 of the attached graph shows our database primaries working harder each day. The actual number of DB Ops is generally stable over this time period. On September 10th, I bounce the primaries to a separate set of servers. These new servers also experience gradual degradation, and the previous primaries (now secondaries) continue to have less idle time compared to the new primaries, and the other set of secondaries. On September 13th, I upgraded the servers to larger instances, and reboot a bunch of servers. After a reboot, all servers start behaving normally, but the new masters are now starting to slow down. (1) Has anyone seen something like this before? |
| Comments |
| Comment by Kelsey Schubert [ 06/Jan/17 ] | |||
|
Hi tewner, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you after upgrading, please provide additional information and we will reopen the ticket. Kind regards, | |||
| Comment by Kelsey Schubert [ 18/Nov/16 ] | |||
|
We have not been able to identify the root cause of the gradual change in cpu load. My understanding is that this does not impact the performance of the cluster, as the number database operations remains consistent. If this is still an issue for you, would you please consider upgrading to MongoDB 3.2 or test the latest MongoDB 3.4 release candidate? These branches contain many improvements, which may significantly impact the behavior you are observing Thank you, | |||
| Comment by may zahav [ 06/Oct/16 ] | |||
|
Hi Thomas Schubert, we are doing a weekly backup of mongo. we took a look at grafana and it seemed that a day after the backup the servers has less load then usual, and every day until the next backup there is more and more load. i attached a printscreen of the graph of cpu idle in time range of 21 days. Thank you! | |||
| Comment by Kelsey Schubert [ 27/Sep/16 ] | |||
|
Thanks for uploading the logs, may. We're still investigating this issue and will update this ticket when we know more. | |||
| Comment by may zahav [ 22/Sep/16 ] | |||
|
Hi, i attached the logs with the relevant output. Thank you, May. | |||
| Comment by Kelsey Schubert [ 21/Sep/16 ] | |||
|
Hi may, Thanks for providing the output of the command. I see that the maximum tree depth of this collection is 3, which indicates that this collection is not affected by Unfortunately, to continue to investigate, we will need to collect additional metrics. If you are unable to upgrade to MongoDB 3.2.9 as I suggested, would you please run the following script on your Primary?
This will collect much of the same information that is captured by later version of MongoDB, and will allow us to continue to investigate this behavior. Please leave this script running for a day before uploading the files so we can see the performance degradation in the stats. Thank you for your help, | |||
| Comment by may zahav [ 20/Sep/16 ] | |||
|
Hi, I've attached a file of the "db.collection.stats( {indexDetails:true})" command output. The file contain output for both shards. | |||
| Comment by Kelsey Schubert [ 20/Sep/16 ] | |||
|
Hi tewner, It is possible that the behavior you are observing may be explained by To determine if your data files are affected by this bug, please provide the following information:
Thank you, | |||
| Comment by Kelsey Schubert [ 19/Sep/16 ] | |||
|
Hi tewner, Thanks for reporting this behavior. Please note that there have been significant improvements to the performance of WiredTiger in MongoDB 3.2. In addition, MongoDB 3.2 includes full time diagnostic capture, which would help us to identify the cause of this behavior. Are you able to upgrade to the latest version of MongoDB to see if it resolves the issue? Kind regards, | |||
| Comment by Michael Tewner [ 19/Sep/16 ] | |||
|
Note that the "arb01" servers on the graph are full data nodes. |