[SERVER-48310] Primary node GlobalLock suddenly arise making service unavailable Created: 20/May/20  Updated: 27/Oct/23  Resolved: 07/Jul/20

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 4.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cen Zheng Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.tar.gz     File mongod.log.2020-05-18T05-01-07.gz     File mongod.log.tar.gz     PNG File screenshot.png    
Operating System: ALL
Participants:

 Description   

Hi, 

We have a hard to identify issue here needs your help. One instance's primary node suffered from GlobalLock suddenly arising, consuming out of all the connections. It looks like relative with WiredTiger cache evict problem because at that time the cache usage had reached 95%. This issue happened at 2020-05-18:12:36:45 GMT+8, and I have provided the FTDC data and some logs needed. Please help look into this issue. Thanks!



 Comments   
Comment by Dmitry Agranat [ 07/Jul/20 ]

Hi mingyan.zc@gmail.com,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Cen Zheng [ 28/May/20 ]

Hi Carl,

Ok. I will try turn off the profiler and see if it will still happen. Thanks!

Comment by Carl Champain (Inactive) [ 27/May/20 ]

Hi mingyan.zc@gmail.com,

I saw the Metadata W lock arising was at 13:30 GMT+8. The issue happened at 12:36:45 GMT+8.

The locks Metadata timeAcquiringMicros W starts rising at 12:35:44.085 GMT+8 (marker A on the screenshot in my previous comment).

 

Yes, we have profiler on by default. But I think this might not be the root cause, just making things worse only.

Have you tried turning the profiler off? We are happy to keep investigating if the problem persists after disabling profiling.

Thank you,
Carl
 

 

Comment by Cen Zheng [ 27/May/20 ]

Hi, Carl,

Thanks for your analyst. Yes, we have profiler on by default. But I think this might not be the root cause, just making things worse only. Because when wt cache usage had reached 95%, all application threads needed to help evict so all normal requests became slow. I saw the Metadata W lock arising was at 13:30 GMT+8. The issue happened at 12:36:45 GMT+8.

Comment by Carl Champain (Inactive) [ 26/May/20 ]

Hi mingyan.zc@gmail.com,

The bottleneck seems to be locks Metadata timeAcquiringMicros W with 177 threads waiting on locks. Please note that this metric involves capped collections and we saw that the profiler is on (the profiler is a capped collection). We’ve seen similar issues in the past due to having the profiler on. Can you please turn off the profiler and see if it solves your issue?

Kind regards,
Carl

Comment by Cen Zheng [ 21/May/20 ]

Hi Carl,

I have uploaded log file before 2020-05-18:12:36:45 GMT+8, please check.

Thanks!

Comment by Carl Champain (Inactive) [ 20/May/20 ]

Hi mingyan.zc@gmail.com,

Thank you for the report.

The earliest log line in the mongod.log files is at 2020-05-18T13:09:07.762+0800. To help us investigate this issue, we would also need the mongod.log files covering this behavior from before it started (so a bit before 2020-05-18:12:36:45 GMT+8). Can you please provide them?

Kind regards,
Carl

Generated at Thu Feb 08 05:16:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.