[SERVER-26021] Primary mongod instance stops serving traffic every few (5) hours Created: 08/Sep/16 Updated: 07/Apr/23 Resolved: 13/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.0.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Francis Pereira | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
A running primary node' throughput degrades over time. Switching primary<->secondary solved the problem. As a precaution, we proactively shutdown the old primary once its secondary so that its ready for the next switch. MMS graph showing the degradation attached. This happens to primary nodes running WiredTiger. MMAP is not affected. We ran into this when we re-synced one of the secondaries nodes with storage engine set to WT and promoted it as primary |
| Comments |
| Comment by Kelsey Schubert [ 31/Jan/17 ] | |||
|
Hi francispereira, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. Unfortunately, without additional information, we cannot continue to investigate this behavior. If this is still an issue for you, please provide iostats and diagnostic.data and we will reopen the ticket. Kind regards, | |||
| Comment by Ramon Fernandez Marina [ 22/Oct/16 ] | |||
|
Hi francis@wizrocket.com, have you had a chance to collect iostat data as requested by Thomas to determine if the issue is indeed hardware limitations or something else? Please let us know. Thanks, | |||
| Comment by Kelsey Schubert [ 06/Oct/16 ] | |||
|
Hi francispereira, Thanks again for providing the diagnostic.data. After reviewing the data, it appears that this node is running into hardware limitations. The read/write load is stable and similar in both sets of diagnostic data. In addition, there is an average of 25 active filesystem read calls which suggests that the workload is I/O bound. To help us continue to investigate this issue, we will need to remove this hardware constraint. Here are two options
If you choose to execute the workload yourself on a larger machine, during your repro please execute the following shell script to collect iostat data each second and upload it along with the diagnostic.data
This will help us to correlate the I/O numbers to events recorded in the diagnostic.data. Thank you for your help, | |||
| Comment by Kelsey Schubert [ 05/Oct/16 ] | |||
|
Hi francispereira, Thank you very much for uploading the diagnostic.data for both MongoDB 3.2.9 and 3.2.10. We are investigating this issue and will update you when we know more. Kind regards, | |||
| Comment by Francis Pereira [ 05/Oct/16 ] | |||
|
Diagnostics data from 3.2.9 and 3.2.10 | |||
| Comment by Francis Pereira [ 30/Sep/16 ] | |||
|
Thanks for the update @Thomas.Schubert. We have a way to reproduce this issue in an isolated environment. I will run our workload against 3.2.10-rc2 and submit diagnostics data. Expect data by Monday. | |||
| Comment by Kelsey Schubert [ 30/Sep/16 ] | |||
|
Hi francispereira, I've taken another look at your cloud group, and see that you have a large number of small databases, collections, and indexes. This schema design benefits MMAPv1 as DB level locks can be mitigated. However, the large number of files may hinder performance during WiredTiger checkpoints. As I mentioned previously, if you could provide the diagnostic.data, we would be able to better determine what is going on here. Thank you, | |||
| Comment by Kelsey Schubert [ 29/Sep/16 ] | |||
|
I’m sorry that you have encountered this issue, and that it has caused such trouble for your team. I see that you recently published a blog post that provides additional context around the issue that you are observing. As you have surmised, the problem is caused by WiredTiger’s behavior when its cache exceeds 95% utilization. A significant amount of work has gone into correcting this issue in MongoDB 3.2.10, which is in testing now and it’s scheduled for release in the coming days. If you have diagnostic.data from the 3.2.9 node, I would be happy to examine it to confirm this diagnosis. Finally, I would like to thank you for sharing your experiences and recommending stronger safety practices to the community when performing major upgrades to MongoDB. Kind regards, | |||
| Comment by Kelsey Schubert [ 12/Sep/16 ] | |||
|
Hi francispereira, Thanks for the additional information. Would you please run the following script until the issue occurs so we can get a better idea of what is going on here?
Afterwards please attach both the ss.log and the iostats.log to this ticket. Thank you, | |||
| Comment by Francis Pereira [ 09/Sep/16 ] | |||
|
I just figured out that a shutdown is not required as described earlier. When throughput degrades, switching P<->S brings things back to normal. | |||
| Comment by Francis Pereira [ 08/Sep/16 ] | |||
|
Hi Thomas, I have uploaded the log files from members of a replica set that showed this problem. I don't have diagnostic.data since I am running 3.0.12. | |||
| Comment by Kelsey Schubert [ 08/Sep/16 ] | |||
|
Hi francispereira, Thanks for reporting this issue. Would you please provide the following information so we can continue to investigate?
I've created a secure upload portal for you to use here. Thanks again, | |||
| Comment by Francis Pereira [ 08/Sep/16 ] | |||
|
I can provide debugging info when this happens by promoting the WT node as primary and letting it run for a few hours. Let me know what you need |