[SERVER-14996] Locks not released if config server goes down / balancer operation and moveChunk command stop working Created: 21/Aug/14 Updated: 21/May/15 Resolved: 21/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.4.8, 2.4.10, 2.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ronan Bohan | Assignee: | Ramon Fernandez Marina |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | balancer, moveChunk, sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||
| Issue Links: |
|
|||||||||||
| Operating System: | ALL | |||||||||||
| Steps To Reproduce: |
Sometimes it may be necessary to run the repro.sh script more than once for it to demonstrate the problem. Typically I'm seeing 1 lock left in state 2, sometimes 2, when the config server restarts - sometimes (rarely) it's the balancer lock itself, sometimes it's a collection lock. When the balancer lock is stuck it becomes impossible to start/stop the balancer, ie sh.stopBalancer() will timeout. When a collection lock is stuck subsequent attempts to move chunks fail, e.g. the following code fails:
It results in aborted messages in the mongod.log files and aborted entries in the changelog collection. And while the balancer can start & stop in this state it too is unable to move chunks, effectively disabling it ability to balance the cluster. |
|||||||||||
| Participants: | ||||||||||||
| Description |
|
If the (primary) config server goes down while the balancer is running it can result in stale locks which don't get released even after the config server is brought back up. This can result in a failure to start/stop the balancer and an inability to perform moveChunk commands. Manual intervention is then required to verify the locks are stale and to release them. There appear to be a few scenarios:
The behavior in 2.4.x vs 2.6.x is a little different:
While my reproduction steps involve taking down the config server it is also likely that the same problem could occur if there are network issues between cluster members and the config server |