[SERVER-67300] Increase in page eviction queue as well as page eviction failures Created: 15/Jun/22  Updated: 29/Jul/22  Resolved: 29/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tejas Jadhav Assignee: Chris Kelly
Resolution: Done Votes: 0
Labels: page-eviction, page-eviction-failure, segmentation-fault
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

VM details for primary and secondary: 8 vCPU, 64 GB RAM, 7.8TB SSD
OS: Ubuntu 16.04.6 LTS
MongoDB: 3.6.2


Attachments: Text File 2022-05-29-mongo-01.log     Text File 2022-05-29-mongo-02.log     PNG File Screenshot 2022-06-15 at 17-58-14 MongoDB Overview - Grafana.png     File metrics.2022-06-27T06-48-02Z-00000    
Operating System: ALL
Steps To Reproduce:

Not sure if this would reproduce this issue exactly,

  1. Crash the secondary with a segmentation fault.
  2. See the impact on primary.
Participants:

 Description   

Cluster state

  • Configuration
    • 1 Primary
    • 2 Secondaries (one secondary does not participate in voting and acts like a standby)
    • 1 Arbiter
  • Data size
    • 1.9TB of data in one collection
    • 30GB index size

Incident timeline (all times are in IST)

Incident date: 29th May 2022

  • 12:12: One of the secondaries crashed due to a segmentation fault.
  • 12:12: Increase in Pages queued for eviction along with errors in page eviction
  • 12:12 - 14:20: Increase in pages queued for eviction as well as page eviction errors still ongoing with some occasional drops in the increase. These drops correspond to checkpoint activities.
  • 14:30 - 14:52: Secondary comes back up and starts syncing from primary. Again seeing increase in page queued for eviction as well as page eviction errors.
  • 14:52: All metrics normalized.
  • 16:00 - now: Increase in page eviction queues and page eviction errors, not subsided yet. Happening as of now.

 



 Comments   
Comment by Chris Kelly [ 29/Jul/22 ]

Tejas,

Because this is on 3.6, I'm going to close this ticket for now. Once you upgrade to 4.2 or newer, we can revisit this if your issue persists and refer to your old FTDC if needed for comparison.

Regards,
Christopher

Comment by Tejas Jadhav [ 04/Jul/22 ]

Chris,

We understand  that 3.6 is unsupported. Hence we have already started with the upgrade process to move to Mongo 5.

Regarding the FTDC data, I've attached the data for 15th June in the ticket.

As mentioned earlier, I won't be able to send you the FTDC at the time when the issue mentioned in the ticket happened since the FTDC file got removed (rotated). The file attached above is for checking we are still seeing high page eviction failures even weeks after the issue happened.

Comment by Chris Kelly [ 30/Jun/22 ]

Hi Tejas,

I see you've brought up discussion of this on the community forums - I'd say that is probably the best option for this, since you're more likely to get suggestions on this behavior. As mentioned, page eviction is generally a symptom, and not a cause in and of itself. 

You can upload your FTDC to the ticket if you'd like, but because this is on 3.6, this isn't likely to lead to a resolution better than upgrading at this point. Even if we were to discover an issue, that version is no longer maintained and it's possible it would be fixed in a newer version. These metrics could still be interesting to capture on the off chance they persist after upgrading though.  

I can do a courtesy lookover of your FTDC if you upload it, but just keep in mind that there isn't going to be much we recommend besides upgrading at this point. As such, I won't be looking deep into this unless your issue persists after upgrading. If you do upload it, just make sure to point out:

  • The exact timestamp(s) of when you observe errors
    • What are you experiencing? High query response times after a secondary node crash/restart?
    • Does it persist or normalize after some time?

 

Christopher

 

Comment by Tejas Jadhav [ 29/Jun/22 ]

Sorry for bumping this up. Any update on the above?

Comment by Tejas Jadhav [ 22/Jun/22 ]

Thanks for getting back Chris.

Yes, I understand that debugging the segmentation fault would be really difficult because the diagnostics data is missing. We have faced that issue in past as well and we also understand that such crashes could very much be because of the outdated Mongo version. We are currently planning to upgrade our cluster to 4.4 or 5.0.

However, regarding the page eviction failures that we are seeing currently, is there any resolution for that? Since it is an ongoing issue, I can provide you with the FTDC data.

Comment by Chris Kelly [ 21/Jun/22 ]

Hi Tejas,

Thank you for your patience. Without FTDC it's hard to see exactly what the state of the system is in leading up to this event. However, there is a chance this may be related to SERVER-38292 which was resolved in 3.6.5. I notice you are running 3.6.2, so you could try a minor version upgrade to the latest patch of 3.6 (3.6.23) and see if your issue persists.

However, 3.6 reached end of life in April 2021 and is no longer supported. It would be more ideal to upgrade your major version of MongoDB to 4.2 or newer for better support in the future. If you stay on 3.6, we will not be able to provide further support on this issue.

Give upgrading a try and let us know if your problem persists!

Regards,
Christopher

 

 

Comment by Tejas Jadhav [ 20/Jun/22 ]

Chris, any update on this issue?

Comment by Tejas Jadhav [ 15/Jun/22 ]

Also adding the backtrace for the segmentation fault on the secondary since it was not captured in the logs above

----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"5630E0B6C000","o":"21F34F1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"5630E0B6C000","o":"21F2709"},{"b":"5630E0B6C000","o":"21F2D76"},{"b":"7F4A75583000","o":"11390"},{"b":"5630E0B6C000","o":"AEAF0F"},{"b":"5630E0B6C000","o":"AEDE8B","s":"__wt_split_reverse"},{"b":"5630E0B6C000","o":"A78294"},{"b":"5630E0B6C000","o":"A792E2","s":"__wt_evict"},{"b":"5630E0B6C000","o":"A722B3"},{"b":"5630E0B6C000","o":"A72647"},{"b":"5630E0B6C000","o":"A74A22","s":"__wt_evict_thread_run"},{"b":"5630E0B6C000","o":"ABDC79"},{"b":"7F4A75583000","o":"76BA"},{"b":"7F4A751B9000","o":"10751D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.2", "gitVersion" : "489d177dbd0f0420a8ca04d39fd78d0a2c539420", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-1128-aws", "version" : "#142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021", "machine" : "x86_64" }, "somap" : [ { "b" : "5630E0B6C000", "elfType" : 3, "buildId" : "90F4CC751C09ABD90756CE2480F0217355B846B5" }, { "b" : "7FFEAE1DC000", "elfType" : 3, "buildId" : "F157DC5CAFAD84C3EC7DC8E12083E764F89804EB" }, { "b" : "7F4A76778000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "C8ACB5535F4B7B9880A62B4B38EBE924668F6446" }, { "b" : "7F4A76510000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "AC6EB239181BE92EF90A74D344006276841F1102" }, { "b" : "7F4A760CB000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "819406AC9B59B46936A823F9F96A3E55E5930EE8" }, { "b" : "7F4A75EC3000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "BB404D52807964CCC7F0815BC2666688A74B958F" }, { "b" : "7F4A75CBF000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "BBA6A2E958188C44B9BDA990278EBE8868B85379" }, { "b" : "7F4A759B6000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "E178A25E6DB28598588C03D898E44FD79BD16E4D" }, { "b" : "7F4A757A0000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F4A75583000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "C557B8146E8079AF46310B549DE6912D1FC4EA86" }, { "b" : "7F4A751B9000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30773BE8CF5BFED9D910C8473DD44EAAB2E705AB" }, { "b" : "7F4A76993000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "98D7BC4313D0D8D5E127E06ACF2319829C5CE61D" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x5630e2d5f4f1]
 mongod(+0x21F2709) [0x5630e2d5e709]
 mongod(+0x21F2D76) [0x5630e2d5ed76]
 libpthread.so.0(+0x11390) [0x7f4a75594390]
 mongod(+0xAEAF0F) [0x5630e1656f0f]
 mongod(__wt_split_reverse+0x11B) [0x5630e1659e8b]
 mongod(+0xA78294) [0x5630e15e4294]
 mongod(__wt_evict+0xFE2) [0x5630e15e52e2]
 mongod(+0xA722B3) [0x5630e15de2b3]
 mongod(+0xA72647) [0x5630e15de647]
 mongod(__wt_evict_thread_run+0x72) [0x5630e15e0a22]
 mongod(+0xABDC79) [0x5630e1629c79]
 libpthread.so.0(+0x76BA) [0x7f4a7558a6ba]
 libc.so.6(clone+0x6D) [0x7f4a752c051d]
-----  END BACKTRACE  -----  

Comment by Tejas Jadhav [ 15/Jun/22 ]

Thanks for looking into this issue Chris.

Unfortunately, the files in diagnostics.data for the time when the issue happened seems to have been removed (incident happened on 29th May). So, I won't be able to provide you with diagnostics data for that day. However, I can provide you with the same for the last 10 days (since we are still seeing page eviction failures in metrics).

I've attached logs for the time around the incident,

  • Sensitive information about the data has been redacted.
  • Log timestamps are in UTC. I've added logs between 6:30 AM - 11:00 AM UTC on the day of the incident (May 29th). The segmentation fault happened at 6:42 AM UTC.
  • 2022-05-29-mongo-01.log contains log for the PRIMARY. 2022-05-29-mongo-02.log contains the log for the SECONDARY.
  • Segmentation fault happened on the SECONDARY.
Comment by Chris Kelly [ 15/Jun/22 ]

Hi Tejas,

Thanks for your report. In order to look into this further, we'll need more info.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to the ticket:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

It is going to be especially important to make sure the logs cover the timeline you are reporting, so we can see what happened leading up to and after the segmentation fault.

Regards,
Christopher

Comment by Tejas Jadhav [ 15/Jun/22 ]

Screenshot of some graphs during the incident timeline attached. Please let me know if you need any additional data points.

Generated at Thu Feb 08 06:07:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.