[SERVER-53852] MongoDB hangs randomly Created: 16/Jan/21  Updated: 29/Oct/23  Resolved: 20/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.2
Fix Version/s: 4.4.6, 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Ashish Madeti Assignee: Sergey Galtsev (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-01-19 at 2.28.30 PM.png     Text File gdb_2021-01-16_13-02-39.txt     File metrics.2021-01-16T01-00-10Z-00000     File metrics.interim     File mongod_500l.log    
Issue Links:
Backports
Depends
Documented
is documented by DOCS-14239 Investigate changes in SERVER-53852: ... Closed
Problem/Incident
causes SERVER-54680 logv2: use mongo::quickExit instead o... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:

Sorry, but I actually don't know how to reproduce it. Like I said, it randomly hangs.

Sprint: Security 2021-02-08, Security 2021-02-22
Participants:
Case:

 Description   

I am running MongoDB 4.4.2 cluster with one Primary, one Secondary and one hidden Secondary. On the hidden Secondary, sometimes (like once every 2 days or so) MongoDB just hangs (once it also happened on the Primary). By "hangs", I mean:

  • I am not able to connect to mongod via mongoshell
  • Secondary stops replicating, and starts lagging (until I restart it manually)
  • but running `rs.status()` on the Primary server shows that hung Secondary is reachable

I referred to https://jira.mongodb.org/browse/SERVER-34190 which looked like a similar issue (but it was fixed in 3.6.4). So I have attached the files that were requested in that issue:

  1. Output of the gdb command: gdb p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y%m-%d_%H-%M-%S"`.txt
  2. Last 500 lines of mongod.log
  3. I have provided the latest files in diagnostic.data folder

Please let me know if you need anything else or you want me to try running some commands.



 Comments   
Comment by Githook User [ 07/Apr/21 ]

Author:

{'name': 'Sergey Galtsev', 'email': 'sergey.galtsev@mongodb.com', 'username': 'brushless-glitch'}

Message: SERVER-53852 MongoDB hangs randomly (combined patches)

(cherry picked from commit 51142d61eeea0a30b2691680663d60c17441afce)
(cherry picked from commit 77d144dad2f49d78903c98985f61bf9245145e49)
Branch: v4.4
https://github.com/mongodb/mongo/commit/ae83ae0fa283efabb93c0fc55bf640cedd4916d7

Comment by Ashish Madeti [ 21/Feb/21 ]

Hello.

 

Just wanted to know in which version will the fix be live in? Does the 'Fix Version' mean that the fix will be live in MongoDB 5.0?

Comment by Sergey Galtsev (Inactive) [ 20/Feb/21 ]

separate ticket SERVER-54680 was created to track quick_exit issue, closing present ticket

Comment by Sergey Galtsev (Inactive) [ 20/Feb/21 ]

The patch broke the Mac build due to non-supported std::quick_exit

Comment by Githook User [ 19/Feb/21 ]

Author:

{'name': 'Sergey Galtsev', 'email': 'sergey.galtsev@mongodb.com', 'username': 'brushless-glitch'}

Message: SERVER-53852 MongoDB hangs randomly
Branch: master
https://github.com/mongodb/mongo/commit/51142d61eeea0a30b2691680663d60c17441afce

Comment by Bruce Lucas (Inactive) [ 12/Feb/21 ]

sergey.galtsev, mark.benvenuto, I think a customer may be unlikely to see a message written to stderr. I wonder if it would be a good idea to write a message to the log file instead or in addition to writing a message to stderr, but without taking a lock. I imagine this might result in a log file that's not valid json, but that might be better than not recording the error anywhere.

Comment by Edwin Zhou [ 19/Jan/21 ]

Hi ashish@provakil.com,

Thank you for your detailed description and attaching all of the necessary files! It really helped expedite the investigation.

We believe that a lock was acquired for logging, and encountered an issue that caused logging to stop. We end up handling the resulting signal and try to recursively log the issue. However, we believe that the logging mechanism attempts to acquire that same lock, causing it to hang.

I'll be passing this along to the security team for further investigation.

Best,
Edwin

Comment by Ashish Madeti [ 16/Jan/21 ]

I failed to mention in my initial description that I recently upgraded this cluster from MongoDB 3.6 to MongoDB 4.4 (via 4.0 and 4.2). And the issue has started happening after that only.

I am running the hidden secondary on a Digital Ocean droplet with 12 vCPUs and 48 GB RAM.

Generated at Thu Feb 08 05:32:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.