[SERVER-7034] timeouts for all connections in migrate critical section Created: 13/Sep/12  Updated: 11/Jul/16  Resolved: 03/Jan/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 2.2.4, 2.3.2

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-7298 thousands of "waiting till out of cri... Closed
related to SERVER-7472 Replication lag can cause cluster to ... Closed
related to SERVER-7500 Set socket timeout on connection used... Closed
is related to SERVER-7922 All operations blocked on one sharded... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Otherwise blackholed hosts can cause hangs.



 Comments   
Comment by auto [ 21/Mar/13 ]

Author:

{u'date': u'2012-10-15T20:40:29Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}

Message: SERVER-7034 add 10 second timeouts to three connections

Change the timeout on three ScopedDbConnections (made while holding
a critical section) from default of zero (no timeout) to 10 seconds.

Conflicts:

src/mongo/s/d_migrate.cpp
Branch: v2.2
https://github.com/mongodb/mongo/commit/415ccd3c89eed61d8fa87efaa94045c4c8d5ad75

Comment by Kay Agahd [ 27/Nov/12 ]

Hello Eliot and Spencer, I added some new logs and MMS-screenshots to the related, private JIRA https://jira.mongodb.org/browse/SUPPORT-366 in order to destroy any doubts about the cause of this bug.
Could you please have a look there and tell us for sure if this bug has been fixed with version 2.2.2-rc1?
Thank you!

Comment by Klébert Hodin [ 22/Nov/12 ]

Thanks for the details. We'll upgrade to 2.2.2.

Comment by Eliot Horowitz (Inactive) [ 22/Nov/12 ]

agahd and klebert - I think its highly unlikely this ticket is causing the issues as it only impacts if there are network outages or servers crashing as spencer said.
Its more likely SERVER-7493 or SERVER-7472 which are fixed for 2.2.2 (currently in 2.2.2-rc1).

Comment by Kay Agahd [ 21/Nov/12 ]

Spencer, my company is in the same situation as Klébert. We encountered at least 3 "waiting of critical section" outages only in the last week. Fortunately the mongo cluster was still accessible these times. Only the concerned mongod nodes were inaccessible. While waiting impatiently for a hotfix, we are analyzing in real time mongod's log in order to restart the node as soon as it's in a "waiting of critical section" in order to avoid or risk any downtime. That's just a quick & dirty hack which is a pity to have to use in production.

I've also already created a private jira where I've uploaded all logs that 10gen asked for:
https://jira.mongodb.org/browse/SUPPORT-366
If you still need more logs or more info to fix this bug asap, please just tell me.
Thanks!

Comment by Spencer Brody (Inactive) [ 21/Nov/12 ]

Klébert, this will be fixed in the upcoming 2.3.1 development release which will roll over into the 2.4 production release. This should only cause an issue if you have a node failure or network connectivity outage in the middle of the critical section of a migration, which should be pretty unlikely since the critical section generally does not last very long. It's surprising to me that you would have hit it multiple times in the last 2 weeks. Have you had multiple node crashes in the last 2 weeks? Has there been a crash every time the cluster has gone unavailable? Or are you having regular network problems? It may be a good idea for you to open a new ticket in our "Community Private" jira project and upload your logs there so we can take a closer look at what went wrong.

Comment by Klébert Hodin [ 21/Nov/12 ]

Any updates on this issue ? We experienced it several times in the last 2 weeks.
This bug always leads to a full cluster unavailability.

Comment by Daniel Pasette (Inactive) [ 06/Nov/12 ]

Need to re-evaluate timeout for recvChunkCommit.

Comment by auto [ 16/Oct/12 ]

Author:

{u'date': u'2012-10-15T13:40:29-07:00', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-7034 add 10 second timeouts to three connections

Change the timeout on three ScopedDbConnections (made while holding
a critical section) from default of zero (no timeout) to 10 seconds.
Branch: master
https://github.com/mongodb/mongo/commit/c35bd13c828582d8f79247a72b76b260f7b1f45b

Generated at Thu Feb 08 03:13:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.