[SERVER-3165] Server locks up on moveChunk Created: 30/May/11 Updated: 12/Jul/16 Resolved: 31/May/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Sharding |
| Affects Version/s: | 1.8.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jalmari Raippalinna | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
2 shards 3 replicas each. 7 mongos clients each having 8 clients connecting to mongos. All running on ubuntu 10.04. |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
We noticed severe lockups randomly on our servers and after investigating the issue for few days, we noticed that this happened everytime balancer started moving chunks around. After moveChunk request is accepted, server only receives connections and queries but they seem to never complete. After 15 seconds, our servers ran out of 20000 connections (servers have ulimit -n 50000) because queries pool up, however mongo shell goes unresponsive on primary shard long before that. Some times moveChunks do go through just fine, but because of our heavy query amounts, I believe we are querying the chunk that is currently being moved when lockup happens. After killing the primary, secondary takes over and enviroment is again stable until balancer kicks in again. I've included log from primary shard server (EU) running with -vvvvv illustrating the issue. Disabling balancer stopped the server lockups so we are running without balancer until this is resolved. We are not sure if this happened same way when we were using 1.8.1 though. |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 31/May/11 ] |
|
There was definitively an issue with RC2 that has been fixed. |
| Comment by Jalmari Raippalinna [ 30/May/11 ] |
|
Oh sorry, I had it there and got dropped out on editing. We are now using 1.8.2rc2 Also I have slight feeling that this started when we moved to 1.8.2rc2 but I'm not entirely sure. We did experience signal 11 segFaults before upgrade, hence upgrading to rc2 Our application code is slightly sensitive to db not responding to queries at all and cause connections to run out on servers, because DB does accept connections and queries just find but they don't time out or respond with error. |
| Comment by Eliot Horowitz (Inactive) [ 30/May/11 ] |
|
1.8.2 hasn't been released yet, so which specific version are you using? |