[SERVER-7298] thousands of "waiting till out of critical section" Created: 09/Oct/12  Updated: 08/Mar/13  Resolved: 12/Oct/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Tad Marshall
Resolution: Incomplete Votes: 0
Labels: crash, replicaset, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux 64 bit


Issue Links:
Related
is related to SERVER-7034 timeouts for all connections in migra... Closed
is related to SERVER-7472 Replication lag can cause cluster to ... Closed
is related to SERVER-7493 Possible for read starvation to cause... Closed
Operating System: Linux
Participants:

 Description   

We are running mongodb v2.2.0 Linux 64 Bit, 3 shards each having 3 nodes.
One node seemed to be dead, so we restarted it. In the logs we found thousands of "waiting till out of critical section", filling up our logs very fastly and making mongodb inaccessible.
What does this error mean?
Here are the logs from the restart, if it matters:
http://pastebin.com/raw.php?i=cZeRJ7NV

This happened to other nodes as well already to another time.

Could it be related to v2.2.0 or to authentification? We're using both since a few time and we encounter this error only since then. Should we downgrade or disable authentification?



 Comments   
Comment by Ian Whalen (Inactive) [ 13/Nov/12 ]

Klebert, the issue in question, SERVER-7034, has not yet been resolved (and thus not backported.) Please add yourself as a watcher to SERVER-7034 so that you'll receive an update when it is backported.

Comment by Klébert Hodin [ 13/Nov/12 ]

We don't use authentification and had the same issue after upgrading our whole cluter from 2.0.6 to 2.2.1.

Comment by alex giamas [ 14/Oct/12 ]

Thanks all for the answers, added 7034 to my watch list.

Comment by Tad Marshall [ 12/Oct/12 ]

SERVER-7034 is the primary ticket for dealing with this issue. Resolving this one as incomplete for now.

Comment by Kay Agahd [ 12/Oct/12 ]

Yes, ok, please follow up the support ticket, so this one can be closed.

Comment by Tad Marshall [ 12/Oct/12 ]

We don't have any indication at this point that this is related to authentication. We think that the fundamental problem is the lack of a timeout on the connection to the config server, making it possible for a single non-responsive config server to "hang" multiple mongod processes. That issue (SERVER-7034) is scheduled to be fixed for version 2.3.0 and will be backported to the 2.2 series pending testing results.

Alex, you can add yourself as a "watcher" of SERVER-7034 if you want to follow its progress.

agahd, we can follow up in the SUPPORT ticket you created, so we can close this one unless you have more that you want to add here.

Tad

Comment by Kay Agahd [ 12/Oct/12 ]

Alex: on our side it was just a guess that it's related to authentication. Maybe Tad can confirm that.
Do you need any other input from me?

Comment by alex giamas [ 12/Oct/12 ]

Regardless of the solution, could you post if it's related to authentication or not? "blackholed hosts" would hint towards a yes but we need to make sure. In case it's not it would help those of us not using authentication from not putting it in our "blocker's list" for upgrade.
Yours,
Alex

Comment by Kay Agahd [ 09/Oct/12 ]

Thanks Tad! Your explanation and the related issues reflect what we've experienced. The whole system seemed to be down even though only 1 mongod node was affected.
Do you suggest us to disable authentication or downgrade to an earlier version to avoid this bug?

I've created a private jira in order to submit you our confidental logs:
https://jira.mongodb.org/browse/SUPPORT-366

Yes, we are in mms. Our group name is idealo.
https://mms.10gen.com/host/list/4f5f582287d1d86fa8b88186#hosts

Thanks!

Comment by Tad Marshall [ 09/Oct/12 ]

Hi agahd,

This may be related to SERVER-7034. The issue in that ticket is that we do not have a timeout (i.e. timeout is infinite) for some connections that are made to the config servers, and if this connection hangs this can block other activity, leading to the "waiting for critical section" messages and an unresponsive server.

Can you post a full log to this ticket so that we can compare symptoms with the cases we have seen?

Are your servers in MMS? Can you post a link?

Tad

Generated at Thu Feb 08 03:14:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.