[SERVER-81803] Understanding cache eviction failures Created: 19/Sep/23  Updated: 03/Oct/23  Resolved: 03/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Yuvaraj Anbarasan Assignee: Unassigned
Resolution: Done Votes: 2
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cache_eviction_gave_up_due_to_detecting_an_out_of_order_tombstone_ahead_of_the_selected_on_disk_update.png     PNG File checkpoint_blocked_page_evictions.png     PNG File eviciton_strategy_dirty_pages.png     PNG File eviction-strategy_clean_pages.png     PNG File in_memory_page_splits.png     PNG File update_bytes_percentage.png    
Participants:

 Description   

Our production sharded cluster is facing issues for a while now where wiredtiger cache of primary nodes of each shard are using application threads to evict pages from cache, resulting in read and write failures. This happens when the update bytes breaches 10% of the cache size. We already tried increasing the eviction thread count to 20 and cache size to 70% but didn’t notice any improvements in cache evictions.

 

Recently one of our nodes auto recovered from this issue and since then the update bytes percentage of that node is under the limit (constantly within 2.5% - 3%). The following are the observations that we made after the auto recovery,

  • Decrease in checkpoint blocked page evictions.
  • Decrease in cache eviction gave up due to detecting an out of order tombstone ahead of the selected on disk update.
  • Change in eviction walk strategy - before the recovery the strategy was only dirty pages. Currently it’s only clean pages.
  • Decrease in eviction failures due to failure during reconciliation.
  • Increase in pages read from disk to cache

 

We also restarted the mongo process on few of the nodes after which we see the update bytes are within the limit. It would be great to have your suggestions or views regarding this issue which would help us to understand the issue better and fix it.

 

Thank you



 Comments   
Comment by Eric Sedor [ 03/Oct/23 ]

Hi yuvaraj.klei@gmail.com,

In general for this issue we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums. Briefly I can suggest that I think the configuration options discussed in WT-6175 are relevant here and more likely to influence behavior than changing eviction worker thread count.

If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project. We'll be happy to leverage the WT team for such an issue. I'll close this ticket for now but we can reopen it if there is a bug. Note that the contents of the diagnostic.data directory of your dbPath will be critical to do so.

Sincerely,
Eric

Comment by Yuvaraj Anbarasan [ 21/Sep/23 ]

Hi vamsi.krishna@mongodb.com , I think it wouldn't possible to share the FDTC file. We're still checking with out infra team regarding this. I have attached some of the metrics in the thread. If any other metrics are required we can share it.

Comment by Vamsi Boyapati [ 20/Sep/23 ]

Could you attach the FTDC data to this ticket?

Comment by Yuvaraj Anbarasan [ 20/Sep/23 ]

We also notice increase in-memory page splits.

Generated at Thu Feb 08 06:47:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.