[SERVER-67300] Increase in page eviction queue as well as page eviction failures Created: 15/Jun/22 Updated: 29/Jul/22 Resolved: 29/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tejas Jadhav | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 0 |
| Labels: | page-eviction, page-eviction-failure, segmentation-fault | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
VM details for primary and secondary: 8 vCPU, 64 GB RAM, 7.8TB SSD |
||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | Not sure if this would reproduce this issue exactly,
|
| Participants: |
| Description |
| Comments |
| Comment by Chris Kelly [ 29/Jul/22 ] | |||||||||||||||||
|
Tejas, Because this is on 3.6, I'm going to close this ticket for now. Once you upgrade to 4.2 or newer, we can revisit this if your issue persists and refer to your old FTDC if needed for comparison. Regards, | |||||||||||||||||
| Comment by Tejas Jadhav [ 04/Jul/22 ] | |||||||||||||||||
|
Chris, We understand that 3.6 is unsupported. Hence we have already started with the upgrade process to move to Mongo 5. Regarding the FTDC data, I've attached the data for 15th June in the ticket. As mentioned earlier, I won't be able to send you the FTDC at the time when the issue mentioned in the ticket happened since the FTDC file got removed (rotated). The file attached above is for checking we are still seeing high page eviction failures even weeks after the issue happened. | |||||||||||||||||
| Comment by Chris Kelly [ 30/Jun/22 ] | |||||||||||||||||
|
Hi Tejas, I see you've brought up discussion of this on the community forums - I'd say that is probably the best option for this, since you're more likely to get suggestions on this behavior. As mentioned, page eviction is generally a symptom, and not a cause in and of itself. You can upload your FTDC to the ticket if you'd like, but because this is on 3.6, this isn't likely to lead to a resolution better than upgrading at this point. Even if we were to discover an issue, that version is no longer maintained and it's possible it would be fixed in a newer version. These metrics could still be interesting to capture on the off chance they persist after upgrading though. I can do a courtesy lookover of your FTDC if you upload it, but just keep in mind that there isn't going to be much we recommend besides upgrading at this point. As such, I won't be looking deep into this unless your issue persists after upgrading. If you do upload it, just make sure to point out:
Christopher
| |||||||||||||||||
| Comment by Tejas Jadhav [ 29/Jun/22 ] | |||||||||||||||||
|
Sorry for bumping this up. Any update on the above? | |||||||||||||||||
| Comment by Tejas Jadhav [ 22/Jun/22 ] | |||||||||||||||||
|
Thanks for getting back Chris. Yes, I understand that debugging the segmentation fault would be really difficult because the diagnostics data is missing. We have faced that issue in past as well and we also understand that such crashes could very much be because of the outdated Mongo version. We are currently planning to upgrade our cluster to 4.4 or 5.0. However, regarding the page eviction failures that we are seeing currently, is there any resolution for that? Since it is an ongoing issue, I can provide you with the FTDC data. | |||||||||||||||||
| Comment by Chris Kelly [ 21/Jun/22 ] | |||||||||||||||||
|
Hi Tejas, Thank you for your patience. Without FTDC it's hard to see exactly what the state of the system is in leading up to this event. However, there is a chance this may be related to However, 3.6 reached end of life in April 2021 and is no longer supported. It would be more ideal to upgrade your major version of MongoDB to 4.2 or newer for better support in the future. If you stay on 3.6, we will not be able to provide further support on this issue. Give upgrading a try and let us know if your problem persists! Regards,
| |||||||||||||||||
| Comment by Tejas Jadhav [ 20/Jun/22 ] | |||||||||||||||||
|
Chris, any update on this issue? | |||||||||||||||||
| Comment by Tejas Jadhav [ 15/Jun/22 ] | |||||||||||||||||
|
Also adding the backtrace for the segmentation fault on the secondary since it was not captured in the logs above
| |||||||||||||||||
| Comment by Tejas Jadhav [ 15/Jun/22 ] | |||||||||||||||||
|
Thanks for looking into this issue Chris. Unfortunately, the files in diagnostics.data for the time when the issue happened seems to have been removed (incident happened on 29th May). So, I won't be able to provide you with diagnostics data for that day. However, I can provide you with the same for the last 10 days (since we are still seeing page eviction failures in metrics). I've attached logs for the time around the incident,
| |||||||||||||||||
| Comment by Chris Kelly [ 15/Jun/22 ] | |||||||||||||||||
|
Hi Tejas, Thanks for your report. In order to look into this further, we'll need more info. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to the ticket:
It is going to be especially important to make sure the logs cover the timeline you are reporting, so we can see what happened leading up to and after the segmentation fault. Regards, | |||||||||||||||||
| Comment by Tejas Jadhav [ 15/Jun/22 ] | |||||||||||||||||
|
Screenshot of some graphs during the incident timeline attached. Please let me know if you need any additional data points. |