[SERVER-24431] collection lock not release for mongod failure Created: 07/Jun/16  Updated: 08/Aug/16  Resolved: 08/Aug/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: bob whitehurst Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-8840 Release distributed locks if shard sh... Closed
Operating System: ALL
Participants:

 Description   

I found a few entries that were similar to this but not quite the same.

Configuration:
5 shards
3 config
1 mongos

A user had a runaway process that was insert way too many documents into a collection. Everything was working properly until we ran out of disk space on one of the shards. When the mongod instance on the shard went down it held a collection lock for migration. After freeing some disk space and restarting the mongod instance, sh.status() indicated that the balancer was running but chunks were not being migrated.

After doing some reading and searching, it appeared that the problem was related to the locks. When I looked at the locks in the config database, I found that there were two locks were being held (state = 2). One on the balancer and one on a collection. The description on the collection lock, indicated that it was holding a migration lock by the shard that went down. After setting the lock state to 0 for both of these entries the balancer resumed normal operations and started to migrate chunks. I may have had to restart the mongod or some of the shards but I am not sure.

Seems like there should be some sort of recovery for a condition when a shard fails and is holding a lock.



 Comments   
Comment by Kelsey Schubert [ 08/Aug/16 ]

Hi bmwmaestoso,

Thank you for your patience; I was unlucky during my first reproduction attempts that the kill signal occurred without a migration lock. I have reproduced this issue and identified that this issue is currently tracked in SERVER-8840. Please note that after 15 minutes these locks should be released and no additional workaround is required. Feel free to vote for SERVER-8840 and watch it for updates.

Thanks again,
Thomas

Comment by bob whitehurst [ 27/Jun/16 ]

Trying to get logs out of our environment is impossible as this is a secure government facility. It would require a whole review process. If I can manually copy the data as long as it isn't overwhelming. Regardless, I don't have these log anymore. I don't have a problem if you can't recreate the issue. I know what to check for now and I know how t fix the problem. It would just be nice if it didn't happen.

Seems like you create the condition manually and then see what happen when you start a shard. This condition could happen at any time any action that might bring down the process without performing any kind of controlled shutdown. These conditions could due to loss of power, OOM condition, or a SIGSEGV.

Comment by Kelsey Schubert [ 17/Jun/16 ]

Hi bmwmaestoso,

Thanks for reporting this issue. Unfortunately, I haven't been able to reproduce this issue yet. To help our investigation, would you please attach the logs of the shard when it ran out of disk space?

Thank you,
Thomas

Generated at Thu Feb 08 04:06:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.