[SERVER-7034] timeouts for all connections in migrate critical section Created: 13/Sep/12 Updated: 11/Jul/16 Resolved: 03/Jan/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 2.2.4, 2.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Otherwise blackholed hosts can cause hangs. |
| Comments |
| Comment by auto [ 21/Mar/13 ] |
|
Author: {u'date': u'2012-10-15T20:40:29Z', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Change the timeout on three ScopedDbConnections (made while holding Conflicts: src/mongo/s/d_migrate.cpp |
| Comment by Kay Agahd [ 27/Nov/12 ] |
|
Hello Eliot and Spencer, I added some new logs and MMS-screenshots to the related, private JIRA https://jira.mongodb.org/browse/SUPPORT-366 in order to destroy any doubts about the cause of this bug. |
| Comment by Klébert Hodin [ 22/Nov/12 ] |
|
Thanks for the details. We'll upgrade to 2.2.2. |
| Comment by Eliot Horowitz (Inactive) [ 22/Nov/12 ] |
|
agahd and klebert - I think its highly unlikely this ticket is causing the issues as it only impacts if there are network outages or servers crashing as spencer said. |
| Comment by Kay Agahd [ 21/Nov/12 ] |
|
Spencer, my company is in the same situation as Klébert. We encountered at least 3 "waiting of critical section" outages only in the last week. Fortunately the mongo cluster was still accessible these times. Only the concerned mongod nodes were inaccessible. While waiting impatiently for a hotfix, we are analyzing in real time mongod's log in order to restart the node as soon as it's in a "waiting of critical section" in order to avoid or risk any downtime. That's just a quick & dirty hack which is a pity to have to use in production. I've also already created a private jira where I've uploaded all logs that 10gen asked for: |
| Comment by Spencer Brody (Inactive) [ 21/Nov/12 ] |
|
Klébert, this will be fixed in the upcoming 2.3.1 development release which will roll over into the 2.4 production release. This should only cause an issue if you have a node failure or network connectivity outage in the middle of the critical section of a migration, which should be pretty unlikely since the critical section generally does not last very long. It's surprising to me that you would have hit it multiple times in the last 2 weeks. Have you had multiple node crashes in the last 2 weeks? Has there been a crash every time the cluster has gone unavailable? Or are you having regular network problems? It may be a good idea for you to open a new ticket in our "Community Private" jira project and upload your logs there so we can take a closer look at what went wrong. |
| Comment by Klébert Hodin [ 21/Nov/12 ] |
|
Any updates on this issue ? We experienced it several times in the last 2 weeks. |
| Comment by Daniel Pasette (Inactive) [ 06/Nov/12 ] |
|
Need to re-evaluate timeout for recvChunkCommit. |
| Comment by auto [ 16/Oct/12 ] |
|
Author: {u'date': u'2012-10-15T13:40:29-07:00', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}Message: Change the timeout on three ScopedDbConnections (made while holding |