[SERVER-24431] collection lock not release for mongod failure Created: 07/Jun/16 Updated: 08/Aug/16 Resolved: 08/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | bob whitehurst | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
I found a few entries that were similar to this but not quite the same. Configuration: A user had a runaway process that was insert way too many documents into a collection. Everything was working properly until we ran out of disk space on one of the shards. When the mongod instance on the shard went down it held a collection lock for migration. After freeing some disk space and restarting the mongod instance, sh.status() indicated that the balancer was running but chunks were not being migrated. After doing some reading and searching, it appeared that the problem was related to the locks. When I looked at the locks in the config database, I found that there were two locks were being held (state = 2). One on the balancer and one on a collection. The description on the collection lock, indicated that it was holding a migration lock by the shard that went down. After setting the lock state to 0 for both of these entries the balancer resumed normal operations and started to migrate chunks. I may have had to restart the mongod or some of the shards but I am not sure. Seems like there should be some sort of recovery for a condition when a shard fails and is holding a lock. |
| Comments |
| Comment by Kelsey Schubert [ 08/Aug/16 ] |
|
Hi bmwmaestoso, Thank you for your patience; I was unlucky during my first reproduction attempts that the kill signal occurred without a migration lock. I have reproduced this issue and identified that this issue is currently tracked in Thanks again, |
| Comment by bob whitehurst [ 27/Jun/16 ] |
|
Trying to get logs out of our environment is impossible as this is a secure government facility. It would require a whole review process. If I can manually copy the data as long as it isn't overwhelming. Regardless, I don't have these log anymore. I don't have a problem if you can't recreate the issue. I know what to check for now and I know how t fix the problem. It would just be nice if it didn't happen. Seems like you create the condition manually and then see what happen when you start a shard. This condition could happen at any time any action that might bring down the process without performing any kind of controlled shutdown. These conditions could due to loss of power, OOM condition, or a SIGSEGV. |
| Comment by Kelsey Schubert [ 17/Jun/16 ] |
|
Hi bmwmaestoso, Thanks for reporting this issue. Unfortunately, I haven't been able to reproduce this issue yet. To help our investigation, would you please attach the logs of the shard when it ran out of disk space? Thank you, |