[SERVER-29120] mongos connections increase indefinitely on chunk md refresh Created: 10/May/17  Updated: 29/Jan/18  Resolved: 16/Jun/17

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.2.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Kaloian Manassiev
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File dan1.png     PNG File dan2.png    
Operating System: ALL
Participants:

 Description   

We have a pretty intensive process that stresses our mongod servers. When this process is running, we encounter the following issue: suddenly, there's a sudden increase in # of connections mongos opens, and then it starts taking more and more cpu until it becomes totally stuck. Please see the 2 graphs attached.

When this happens, mongos logs shows A LOT of open/close connection messages (to the mongod servers). This trigger for this process seems to be a chunk move. When the move is complete, the ChunkManager refreshes - and then the connections starts opening and closing endlessly.

Note that our mongod+mongos run on super strong machines, c4.4xlarge.

Is this known / any workaround?

SERVER-22611 (opened by myself) seems to be related, but fixed only in 3.4, we are using 3.2.12.



 Comments   
Comment by Kaloian Manassiev [ 16/Jun/17 ]

Hi yoni@appsee.com,

Since we haven't heard back I am going to close this ticket 'Cannot Reproduce'. If you are able to get the logs, please reopen it.

Best regards,
-Kal.

Comment by Kaloian Manassiev [ 16/May/17 ]

Hi yoni@appsee.com,

Are you continuing to experience this problem and were you able to obtain the mongos logs? Unfortunately without the logs we have no idea what is going on and even with 3.2.12 logs we might not have sufficient information, but I would like to take a look first.

I am fairly certain it is not related to SERVER-22611, because this bug manifests as a stall on the mongos side due to repeated full collection reload under a mutex and there is no way this would cause connections to be closed.

Best regards,
-Kal.

Comment by Yoni Douek [ 11/May/17 ]
  • Random mongos.
  • Ends only on restart.
  • I think it happens AFTER the refresh. The critical section is pretty short so it's hard to know.
  • 5 shards.

Re logs - this time it's not so easy but ill try to get them. You basically see a lot of connection open/closes. We won't be able to upgrade to 2.3.13 so easily.

Comment by Kaloian Manassiev [ 11/May/17 ]

Hi yoni@appsee.com,

Would it be possible to attach the log from the mongos which exhibits such an event? Ramon included a secure upload link.

Also I have a couple of questions:

  • Is it always the same mongos to which this is happening?
  • Did this event end on its own or you had to restart the mongos?
  • When you say that the trigger for this event is the refresh of the chunk manager - do you mean that the connections start to accumulate after the refresh happens or possibly while the shard is in the critical section and operations are blocked on it?
  • How many shards do you have?

Best regards,
-Kal.

Comment by Ramon Fernandez Marina [ 10/May/17 ]

I've preemptively created a secure upload portal for file uploads. I think we'd need to see logs to determine if this is SERVER-22611, but I'm not sure at the moment if 3.2.12 logs will have that information or what logs will be most interesting.

In 3.2.13 we've added a number of logging improvements to help troubleshoot connection related issues, so we may need to ask you to upgrade a mongos to 3.2.13 and provide logs from that node when the behavior you describe appears.

Generated at Thu Feb 08 04:19:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.