[SERVER-3165] Server locks up on moveChunk Created: 30/May/11  Updated: 12/Jul/16  Resolved: 31/May/11

Status: Closed
Project: Core Server
Component/s: Performance, Sharding
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jalmari Raippalinna Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

2 shards 3 replicas each. 7 mongos clients each having 8 clients connecting to mongos. All running on ubuntu 10.04.


Attachments: Text File logi25.txt    
Operating System: Linux
Participants:

 Description   

We noticed severe lockups randomly on our servers and after investigating the issue for few days, we noticed that this happened everytime balancer started moving chunks around.

After moveChunk request is accepted, server only receives connections and queries but they seem to never complete. After 15 seconds, our servers ran out of 20000 connections (servers have ulimit -n 50000) because queries pool up, however mongo shell goes unresponsive on primary shard long before that. Some times moveChunks do go through just fine, but because of our heavy query amounts, I believe we are querying the chunk that is currently being moved when lockup happens.

After killing the primary, secondary takes over and enviroment is again stable until balancer kicks in again.

I've included log from primary shard server (EU) running with -vvvvv illustrating the issue.

Disabling balancer stopped the server lockups so we are running without balancer until this is resolved.

We are not sure if this happened same way when we were using 1.8.1 though.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 31/May/11 ]

There was definitively an issue with RC2 that has been fixed.
You can try the nightly or wait for 1.8.2 this week

Comment by Jalmari Raippalinna [ 30/May/11 ]

Oh sorry, I had it there and got dropped out on editing. We are now using 1.8.2rc2

Also I have slight feeling that this started when we moved to 1.8.2rc2 but I'm not entirely sure. We did experience signal 11 segFaults before upgrade, hence upgrading to rc2

Our application code is slightly sensitive to db not responding to queries at all and cause connections to run out on servers, because DB does accept connections and queries just find but they don't time out or respond with error.

Comment by Eliot Horowitz (Inactive) [ 30/May/11 ]

1.8.2 hasn't been released yet, so which specific version are you using?
There were some issues with some of the RCs - but are fixed as of the latest nightlies.

Generated at Thu Feb 08 03:02:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.