[SERVER-48388] MongoDB processes get hung up when trying to acquire lock Created: 22/May/20  Updated: 07/Jul/20  Resolved: 07/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Raghu c Assignee: Dmitry Agranat
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.tar.gz     HTML File pstatck_file    
Issue Links:
Related
related to WT-3972 Allow more than 64K cursors to be ope... Closed
Operating System: ALL
Participants:

 Description   

We have a three node MongoDB replica set deployed in our Prod environment. The primary mongod process gets hung up after running for 12 hours or so. We are able to see too many threads (around 15,000) stuck in the same stack,

 

#0 0x00007efd9c853c21 in do_futex_wait () from /lib64/libpthread.so.0
#1 0x00007efd9c853ce7 in __new_sem_wait_slow () from /lib64/libpthread.so.0
#2 0x00007efd9c853d85 in sem_timedwait () from /lib64/libpthread.so.0
#3 0x00005580f6d94c6c in mongo::TicketHolder::waitForTicketUntil(mongo::Date_t) ()
#4 0x00005580f681aedc in mongo::LockerImpl<false>::_lockGlobalBegin(mongo::LockMode, mongo::Duration<std::ratio<1l, 1000l> >) ()
#5 0x00005580f680a724 in mongo::Lock::GlobalLock::_enqueue(mongo::LockMode, unsigned int) ()
#6 0x00005580f680a79e in mongo::Lock::GlobalLock::GlobalLock(mongo::OperationContext*, mongo::LockMode, unsigned int, mongo::Lock::GlobalLock::EnqueueOnly) ()
#7 0x00005580f680a7e8 in mongo::Lock::GlobalLock::GlobalLock(mongo::OperationContext*, mongo::LockMode, unsigned int) ()

 

Attaching the pstack and diagnostic metrics.

Due to the sensitive nature of the db.logs they cannot be shared.  The db logs had statements that showed around 14484 connections were open.

 

2020-05-22T13:38:49.633+0000 I NETWORK [listener] connection accepted from 172.28.96.186:43437 #18988 (14484 connections now open)

 



 Comments   
Comment by Dmitry Agranat [ 07/Jul/20 ]

Hi raghu.9208@gmail.com,

Glad to hear the issue did not occur on MongoDB 4.0.10. I will go ahead and close this case.

Regards,
Dima

Comment by Raghu c [ 03/Jul/20 ]

Hi dmitry.agranat,

 I got a chance to test it out in 4.0.10 and the same issue did not occur.

Also the same issue did not occur in 3.6.2 when using the MMAPv1 Storage Engine.

Comment by Dmitry Agranat [ 30/Jun/20 ]

Hi raghu.9208@gmail.com,

Did you have a chance testing this on a MongoDB version > 3.6.4 or later? If you did, did this issue occurred again?

Thanks,
Dima

Comment by Raghu c [ 27/May/20 ]

Hi dmitry.agranat,

Thank you so much for the quick reply. I will update this thread after testing with one of the latest MongoDB versions.

Can you please help me to understand what went wrong so that we can include this as part of our stress testing?

The issue you've linked will occur only if more than 64K cursors are opened simultaneously on a data source.  I'm fairly positive that our application does not open so many cursors in so little time. Also something I've missed to update is that even secondaries become unresponsive and we are not able to connect to any of the running instances.

Also what is a WiredTiger Table? Is it an in-memory data structure that holds all the data before persistence to the disk ?

Thanks in advance.

Comment by Dmitry Agranat [ 26/May/20 ]

Hi raghu.9208@gmail.com, thank you for the report.

Based on the stack trace and your current MongoDB version (3.6.2), this might be related to WT-3972. Please let us know if you still see this issue after upgrading to MongoDB 3.6.4 or later. In general, we always recommend using the latest MongoDB versions (3.6.18 ; 4.0.18 ; 4.2.7)

Thanks,
Dima

Generated at Thu Feb 08 05:16:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.