[SERVER-34190] MongoDB process hangs after some random time Created: 29/Mar/18  Updated: 24/Apr/18  Resolved: 10/Apr/18

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.4.13, 3.4.14, 3.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Piotr Klimek Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Zrzut ekranu 2018-04-04 18.50.39.png    
Issue Links:
Depends
depends on SERVER-33445 Add signal handler to generate stack ... Closed
Duplicate
duplicates WT-3972 Allow more than 64K cursors to be ope... Closed
Operating System: ALL
Steps To Reproduce:

Not found yet.

Participants:

 Description   

We have a problem with mongodb (3.6.3) PRIMARY server. After some time it gets to a state where it is still PRIMARY but it is not accepting connections. The problem is that it keeps PRIMARY role and because of that our app crashes. Mongodb restart on PRIMARY server helps and everything backs to normal.

We are hosting mongodb in Amazon on 3 Ubuntu m5.4xlarge instances with 3000 IOPS EBS volumes.

During the crash we have ~30% more connections to MongoDB than usual, but they are still far below the limits and far below fs.file-max setting that is set to 6430188. No other metric looks suspicious. RAM, CPU, Disk and Network usage are on the same level as just before crash and right after restart of PRIMARY. We have already migrate MongoDB from 3.4.14 to 3.6.3 and problem still occurs every 1-2 days. We have also changed priority for PRIMARY server and migrate this role to another host so it’s not connected to any specific machine.

There is nothing interesting on logs.

Here is the output of some commands that we run when the server was in not responsive state:

$ mongo -u root -p pass --authenticationDatabase admin --eval 'rs.status()'
MongoDB shell version v3.6.3
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.6.3
2018-03-29T02:21:48.272+0000 E QUERY    [thread1] Error: network error while attempting to run command 'saslStart' on host '127.0.0.1:27017'  :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1608:20
@(auth):6:1
@(auth):1:2
 
$ mongo -u root -p pass --authenticationDatabase admin --eval 'db.runCommand( { "connPoolStats" : 1 } )'
MongoDB shell version v3.6.3
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.6.3
2018-03-29T02:21:48.272+0000 E QUERY    [thread1] Error: network error while attempting to run command 'saslStart' on host '127.0.0.1:27017'  :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1608:20
@(auth):6:1
@(auth):1:2
 
$  mongo -u root -p pass --authenticationDatabase admin --eval 'db.runCommand( { serverStatus: 1 } )'
MongoDB shell version v3.6.3
connecting to: mongodb://127.0.0.1:27017
2018-03-29T02:21:48.382+0000 W NETWORK  [thread1] Failed to connect to 127.0.0.1:27017, in(checking socket for error after poll), reason: Connection refused
2018-03-29T02:21:48.382+0000 E QUERY    [thread1] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed :
connect@src/mongo/shell/mongo.js:251:13
@(connect):1:6

Any idea what else should we check to debug it?



 Comments   
Comment by Kelsey Schubert [ 10/Apr/18 ]

Hi pklimek,

Thank you for the stacks. After review, it appears that you are encountering WT-3972. A fix for this issue will be included in MongoDB 3.6.4 and the release candidate for this version, 3.6.4-rc0, is currently available if you would like to upgrade to take advantage of this fix immediately.

Kind regards,
Kelsey

Comment by Piotr Klimek [ 04/Apr/18 ]

Hi,
This debug that I've sent was gathered during the MongoDB being frozen. Anyway, it's happening quite often so I've just sent gdb output from last night. This one looks better than previous and it's full of other data than [New LWP xxxx]. I hope that it will help you find the root cause of the problem.

We have already changed:

  • MongoDB version from 3.4 to 3.6
  • EC2 node type and size few times
  • We have been rebooting server to prevent crashes
  • We have changed the node that we are using for PRIMARY several times

All this changes leads us just to another crash and those crashes are happening every 24-48 hours.
All the graphs and metrics looks more or less fine just before the crash, the only one that changes just before is number of available tickets. It's dropping down just before the crash. Today crash was around 03:54 CET. At 04:05 after restart everything is back to normal.

Comment by Kelsey Schubert [ 04/Apr/18 ]

Hi pklimek,

Unfortunately, we'll need the stacktraces generated by copy pasting the command when the mongod is hung to determine the root cause of this behavior:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt

I'm not sure exactly occurred with the previous invocation of gdb. It's possible that it was interrupted. Would you please try to collect gdb stack traces again? Please use the command that I have provided without modification (unless there are multiple running mongods) as the the date timestamp lets us know when the command was executed relative to the other diagnostic information provided.

The complete log files would also help give us more context, but I'm afraid that without the gdb stacktraces, we'll struggle to determine the root cause.

Thank you,
Kelsey

Comment by Piotr Klimek [ 31/Mar/18 ]

I'm sorry, somehow I've missed this. Tar archive is already uploaded.

Comment by Kelsey Schubert [ 31/Mar/18 ]

Thanks pklimek, we'll take a look. Could you also provide an archive of the diagnostic.data as I previously requested?

Thanks again,
Kelsey

Comment by Piotr Klimek [ 31/Mar/18 ]

Today during the night server crashed again, I have the logs from GDB, (I've uploaded them to the same place with rest of the logs), but those logs looks weird, its 53k lines of this:

[New LWP 13523]
[New LWP 13559]
[New LWP 13573]
[New LWP 13574]

Of course numbers are changing, but there is nothing more than that.

Comment by Piotr Klimek [ 30/Mar/18 ]

Hello Kelsey,
Thanks for quick respond, I have added this gdb command to our debug tool that we are running when problems occurs. Unfortunately this issue did not show up yet. I have just send some logs from the last crash via the link that you provided. The most interesting part is here:

2018-03-29T02:12:21.504+0000 I COMMAND  [conn36451] command xxxxxxxxxx
2018-03-29T02:12:53.346+0000 I NETWORK  [conn37858] end connection x.x.x.x.x:34500 (2450 connections now open)

As you can see there is 30 seconds gap in logs which is quite strange on heavy loaded production database, after this gap there are only new connections logged until database is restarted.

Comment by Kelsey Schubert [ 29/Mar/18 ]

Hi pklimek,

Thank you for reporting this issue. Would you please upload the logs and an archive of the diagnostic.data directory from an affected node? I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

If this diagnostic information is insufficient to identify the root cause, my next recommendation would be to collect gdb using the following command:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt

To speed our investigation, would you please collect gdb from the unresponsive node the next time this issue occurs and upload the file?

Thank you for your help,
Kelsey

Generated at Thu Feb 08 04:35:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.