[SERVER-52655] Mongo thread hangs intermittently. Created: 06/Nov/20  Updated: 02/Dec/20  Resolved: 02/Dec/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nitesh Vaidyanath Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-11-09 at 4.27.54 PM.png     PNG File Screen Shot 2020-11-09 at 4.28.45 PM.png     PNG File Screenshot 2020-11-06 at 11.47.04 AM.png     PNG File Screenshot 2020-11-06 at 12.01.46 PM.png     Text File gdb.txt     File metrics.2020-11-05T19-02-22Z-00000     File metrics.2020-11-05T23-13-52Z-00000     File metrics.2020-11-06T03-11-36Z-00000    
Operating System: ALL
Participants:

 Description   

Hello,

  Mongod thread hangs on "recvmsg" system call because of this seeing very high load on replica set. I don't see any COLLSCAN in logs. When thread hangs read and write queue increases which is obvious. Not sure what is happening with this replicaset.

Currently running 3.6.16 mongod on aws i3.16xlarge instance type. PRIMARY is failing over all the time whenever all the threads hangs. 

 



 Comments   
Comment by Edwin Zhou [ 02/Dec/20 ]

Hi Nitesh,

Glad to hear it's been resolved. I'll close this ticket now as requested.

 

Best,

Edwin

Comment by Nitesh Vaidyanath [ 26/Nov/20 ]

Hi Edwin, feel free to close the Jira case. Issue was with one of our clients, after restarting client service issue got resolved. Thanks for your help and quick response. 

Comment by Edwin Zhou [ 24/Nov/20 ]

Hi nvaidyanath@paloaltonetworks.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please collect perf, logs, and diagnostic.data with the timestamps and attach it to this ticket?

Thanks,
Edwin

Comment by Edwin Zhou [ 10/Nov/20 ]

Hi nvaidyanath@paloaltonetworks.com,

Thanks for your report and for providing the gdb, ftdc and screenshots detailing the events. After some investigation, we were unable to pinpoint an exact reason why you're witnessing performance issues. We were unable to make any concrete correlations as the screenshots provided are missing a timezone.

Could you provide a detailed timeline of events when the queuing occurs, when the hanging occurs, when the node fails over, and when the stack traces were collected?

While the mongod is running and your issues are occurring, would you be able to collect perf during the incidents you described? Please make sure an exact timestamp is included. If it's not, you can run perf with the --start option.

# record call stack samples and generate text output on test node
# note the exact time at which the recording was done in order
# to allow correlation with other events
perf record  -a -g -F 99 sleep 60
perf script >perf.txt

We will also want the diagnostic.data, logs, and the perf.txt 

Best,

Edwin

Generated at Thu Feb 08 05:28:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.