[SERVER-22964] IX GlobalLock being held while wating for wt cache eviction Created: 04/Mar/16  Updated: 19/Nov/16  Resolved: 30/Mar/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.3
Fix Version/s: 3.0.12, 3.2.5, 3.3.4

Type: Bug Priority: Critical - P2
Reporter: CenZheng Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: WTplaybook, code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diags1.tgz     File global_lock_hang.ts     File mongod.conf     File test.sh    
Issue Links:
Depends
depends on WT-2451 Allow eviction of metadata Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Completed:
Steps To Reproduce:

1. setup a two node replSet using the attached mongod.conf(only specify storage related)
2. run test.sh
3. It will hang after running the script for about a long time(2 days+)

Participants:

 Description   

hi,
I'm currently doing a pressure test for 3.2.3 using the attached test script. The testing environment is a two nodes replSet using WiredTiger. After running the script for long time(about 2 days). The mongod process will hang. As you can see in the attached thread stack, a lot of threads were waiting for the GlobalLock which was being held by someone(probably the thread doing profile, i.e. Thread 501). Looks like the thread locked it with MODE_IX and was doing a cappedDelete operation, making room for new profiling document. But it seems that it was waiting on WT's cache evict_waiter_cond and hangs there forever.



 Comments   
Comment by CenZheng [ 11/Mar/16 ]

Get it, Thank you Ramon~

Comment by Ramon Fernandez Marina [ 10/Mar/16 ]

Thanks for the additional information zhcn381, we believe we've identified the source of the problem and have produced a fix in WT-2451. Here's the explanation from a colleague:

What's causing the issue is the large amount of create and drop operations: each time a create or drop is done we update the WiredTiger metadata. The metadata is currently never evicted from the cache which means that we never remove entries created for deleted items, so if there are enough create/drop operations the metadata can grow quite large. If the cache is reasonably constrained we can end up with a situation where the metadata fills the entire cache and prevents other operations from completing.

This large amount of create and drop operations is not very common in production cases, but not unheard of either. If this will be your production use case, until this fix makes it into a stable release note that using a larger WiredTiger cache will easily help you work around the issue.

Regards,
Ramón.

Comment by CenZheng [ 10/Mar/16 ]

Hi, Ramon

The situation occurred again, I have uploaded the latest diagnostics data(metrics.2016-03-09T13-01-56Z-00000.tar.gz). You can have a look.

Thanks~

Comment by Ramon Fernandez Marina [ 08/Mar/16 ]

Thanks for uploading zhcn381, I can see the diagnostics data and I'm taking a look.

Comment by CenZheng [ 08/Mar/16 ]

hi Ramon, I have uploaded last time's diagnostic.data file, pls check!

Thanks

Comment by Ramon Fernandez Marina [ 07/Mar/16 ]

zhcn381, I have been running your repro script for almost 48h but my setup hasn't hung. We do have some ideas about what the issue could be, but if it is possible for you to upload the diagnostic.data folder this information would still be useful investigating this ticket.

Thanks,
Ramón.

Comment by CenZheng [ 05/Mar/16 ]

hi Ramon,
I can't upload the diagnositc.data directory due to the poor network status. I'll try on next Monday if you still need it then.
Thanks

Comment by Ramon Fernandez Marina [ 04/Mar/16 ]

zhcn381, can you please upload here the contents of the diagnostic.data directory from your dbpath and the full log file of the affected node?

I'll try to reproduce this behavior on my end in the mean time, but since you say it takes two days it will be faster to look at the diagnostic data already captured by your system.

Thanks,
Ramón.

Generated at Thu Feb 08 04:01:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.