[SERVER-21562] Mongo DB 3.0.6 Hangs and becomes unresponsive to connections Created: 19/Nov/15  Updated: 09/Jan/16  Resolved: 09/Jan/16

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Aish Reygan Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb - Copy.log    
Operating System: ALL
Steps To Reproduce:

Start MongoDB up, in about 15 mins, db stops responding or responds slowly.

Participants:

 Description   

A mongo DB server was recently upgraded from 2.6.1 to 3.0.6 on a Windows Server 64 bit. What we see is that every 10 to 15 mins the mongodb database becomes unresponsive.

The mongo clients are mainly PHP mongodb driver and libMongoC driver. We disabled the libMongoC driver and left the PHP mongodb driver and it still caused the same issue.

We see that the following in the log, the full log is also attached. Basically a lot of open connections coming in and not getting serviced or closed.

2015-11-19T12:52:14.603+0100 I NETWORK  [conn44] end connection 127.0.0.1:55794 (16 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn57] end connection 127.0.0.1:57191 (15 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn59] end connection 127.0.0.1:57192 (15 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn45] end connection 127.0.0.1:55795 (16 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn47] end connection 127.0.0.1:55926 (16 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn48] end connection 127.0.0.1:56166 (11 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn46] end connection 127.0.0.1:55799 (14 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn58] end connection 127.0.0.1:57193 (14 connections now open)
2015-11-19T12:52:14.603+0100 I NETWORK  [conn53] end connection 127.0.0.1:56195 (15 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn54] end connection 127.0.0.1:56196 (11 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn49] end connection 127.0.0.1:56174 (6 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn51] end connection 127.0.0.1:56193 (5 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn52] end connection 127.0.0.1:56194 (5 connections now open)
2015-11-19T12:52:14.604+0100 I NETWORK  [conn60] end connection 127.0.0.1:57303 (3 connections now open)
2015-11-19T12:52:14.605+0100 I NETWORK  [conn56] end connection 127.0.0.1:57190 (2 connections now open)
2015-11-19T12:52:14.605+0100 I NETWORK  [conn55] end connection 127.0.0.1:56239 (1 connection now open)
2015-11-19T12:52:16.979+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57407 #61 (2 connections now open)
2015-11-19T12:52:16.984+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57408 #62 (3 connections now open)
2015-11-19T12:52:16.987+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57409 #63 (4 connections now open)
2015-11-19T12:52:17.030+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57410 #64 (5 connections now open)
2015-11-19T12:52:17.049+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57411 #65 (6 connections now open)
2015-11-19T12:52:17.050+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57412 #66 (7 connections now open)
2015-11-19T12:52:17.119+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57414 #67 (8 connections now open)
2015-11-19T12:52:18.070+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57483 #68 (9 connections now open)
2015-11-19T12:52:18.124+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57490 #69 (10 connections now open)
2015-11-19T12:52:18.193+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57500 #70 (11 connections now open)
2015-11-19T12:52:18.221+0100 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:57504 #71 (12 connections now open)

Switching back from 3.0.6 to 2.6.1 makes it stable again.
Can you please look and tell us if there is anything we can do to address this issue.



 Comments   
Comment by Ramon Fernandez Marina [ 23/Nov/15 ]

The logs indicate that all these connections are being originated from localhost. I couldn't find any server-side changes that may account for this behavior, so you may need to do something at the application level to throttle and/or close the connections to the server. One thing you can try is setting a lower timeout to close idle open cursors (see SERVER-8188 for details) and see if that alleviates the issue, but so far there's not enough evidence of a bug in the server.

If you were able to create a reproducer so we could observe this behavior on this end that would help investigate the issue. Alternatively you can privately and securely upload the application code that triggers this behavior here.

Thanks,
Ramón.

Comment by Aish Reygan [ 19/Nov/15 ]

Hello,

Regarding your questions
> 1. What version of libmongoc and the php driver are you using? If you're not using the latest version of both, it might be useful to confirm that this issue appears with the latest version.

We are using PHP Driver version 1.6.8 running PHP 5.6.12
We were using libmongoc 1.0.

> it might be useful to confirm that this issue appears with the latest version.

Since this issue is reproduceable in a production system, we are unable to switch things too much on the fly. We are trying to see if we can see this behavior on our development machines.

> Have you managed to reproduce this issue with MongoDB 3.0.7?

Not yet, If you believe there are bugs fixed in 3.0.7 that could help address this issue, we can definitely give it a try next time the system is stopped.

> Do you have resource limitations configured that might reduce the number of available connections or open file descriptors?

Not that we are aware of. This is running on Windows Server so ulimit etc might not apply and the connections don't seem excessive (ie 100s not 1000s)

> What kind of activity is your client doing on the database? Do you have a sense if the traffic predominantly write or read heavy? Do you know if there was a change in client activity after the upgrade?

Mostly read heavy, not much writes at that time. No specific difference in client activity.

> Are you upgrading and downgrading instances that have data? Be aware of the downgrade restriction for moving to 2.6 from 3.0. If you have an existing data set, do you know, approximately, how large the data set is?

Thanks for the heads up. We weren't aware of that. Next time we have to downgrade any customer we will downgrade them to 2.6.5. During the downgrade, it asked us to run --upgrade and doing so made it startup with the pre-existing data fine. The data set is not large, most have less than a 100K records, one specific database contained a few million records and that was the database that had a lot of writes to it since those records are kind of log entries.

Thanks for anything else that you can point us towards.

Comment by Sam Kleinman (Inactive) [ 19/Nov/15 ]

Hello,

Thanks for this report. I wanted to ask a few questions to help us understand what's going on in this situation.

  1. What version of libmongoc and the php driver are you using? If you're not using the latest version of both, it might be useful to confirm that this issue appears with the latest version.
  2. Have you managed to reproduce this issue with MongoDB 3.0.7?
  3. Do you have resource limitations configured that might reduce the number of available connections or open file descriptors?
  4. What kind of activity is your client doing on the database? Do you have a sense if the traffic predominantly write or read heavy? Do you know if there was a change in client activity after the upgrade?
  5. Are you upgrading and downgrading instances that have data? Be aware of the downgrade restriction for moving to 2.6 from 3.0. If you have an existing data set, do you know, approximately, how large the data set is?

Thanks for your patience, and I hope we can understand what's going on here.

Regards,
sam

Generated at Thu Feb 08 03:57:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.