Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Checkpoints
Labels:
- code-quality
Environment:
https://github.com/wiredtiger/wiredtiger/tree/mongodb-5.0.13

Assigned Teams:

Storage Engines
Sprint:
2024-05-28 - FOLLOW ON SPRINT
Story Points:
None

I found that one of our mongod nodes stored 2T of data, its freeStorageSize was 120G, and a large number of slow queries occurred at a certain moment during the checkpoint.
By printing the stack, I found that these user requests were stuck in obtaining the hazard pointer, and the checkpoint thread was making changes to the allocated available and discarded lists.

So I decided to rebuild the mongod node. The freeStorageSize of the new node was reduced to 10G, and these slow queries disappeared.

I suspect that freeStorageSize is too large, which makes the available list structure more complex, so checkpoint takes a particularly long time to process.

__ckpt_process]

Live_lock] has been held for a long time.

Therefore, the evict thread is stuck on the live_lock lock, and the page status is WT_REF_LOCKED, the corresponding request is waiting to get a hazard pointer of the page __wt_page_in_func.

May I ask if my suspicion is correct?
When processing the available list during checkpoint, is it necessary to be mutually exclusive with evict?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-04-30-14-52-04-437.png
424 kB
Apr 30 2024 06:52:07 AM UTC

depends on

WT-12992 If the freeStorageSize is too large, a large number of slow queries will occur during checkpoint.

Closed

Assignee:: Chenhao Qu
Reporter:: Chao Yin
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Apr 29 2024 03:41:58 AM UTC
Updated:: May 31 2024 04:47:18 PM UTC
Resolved:: May 28 2024 02:22:14 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates