[SERVER-8840] Release distributed locks if shard shuts itself down during a migration Created: 04/Mar/13  Updated: 06/Dec/22  Resolved: 12/Jul/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Spencer Brody (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File config.log     File moveChunk-crash.log    
Issue Links:
Duplicate
is duplicated by SERVER-24431 collection lock not release for mongo... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

If a shard primary purposely shuts itself down during a migration (usually because of a problem communicating with the config servers while in the critical section), we should release all distributed locks while shutting down so that future migrations aren't blocked after the shard recovers (from RS failover and/or by manually bringing the primary back online).



 Comments   
Comment by Gregory McKeon (Inactive) [ 12/Jul/18 ]

As of 3.4, the config server now holds the balancer lock.

Comment by Spencer Brody (Inactive) [ 03/Apr/13 ]

Yep, I'm an idiot, it absolutely does time out after 15 minutes and reclaim the lock.

Per conversation with greg_10gen, modifying the focus of the ticket slightly to focus on releasing the lock when we purposely shut down mid-migration.

Comment by Eliot Horowitz (Inactive) [ 02/Apr/13 ]

All distributed locks timeout, why isn't the regular timeout working in this case?

Comment by Spencer Brody (Inactive) [ 02/Apr/13 ]

Another option for recovery would be for the shards to have a command that indicates whether they have any in-progress migrations, and the balancer could run that command against the shards involved in the last known migration when considering forcing the lock.

Comment by John Sumsion [ 02/Apr/13 ]

This thread was useful in recovering from this:

Comment by John Sumsion [ 02/Apr/13 ]

I hit this while pre-splitting chunks when preparing for a mass-import. In my case the moveChunk() failed with an error like:

"exception: DBClientBase::findN: transport error: ip189:27017 ns: admin.$cmd query: { moveChunk: \"eureka.person\", from: \"rs12/ip191:27017,ip214:27017,ip189:27017\", to: \"rs05/ip175:27017,ip240:27017,ip120:27017\", fromShard: \"rs12\", toShard: \"rs05\", min: { _id: \"G3P-VHMM\" }, max: { _id: \"G4L-GY9M\" }, maxChunkSizeBytes: 67108864, shardId: \"eureka.person-_id_\"G3P-VHMM\"\", configdb: \"ip232:27019,ip211:27019,ip129:27019\", secondaryThrottle: false, waitForDelete: false }

After looking in the logs, I saw that the ip189 had gone down during the moveChunk, socket error communicating with one of the config nodes. I'll attach a log from the box that died, and from the config server.

I know this is just one of the possible situations that could cause this, but perhaps it's helpful in fixing this.

Generated at Thu Feb 08 03:18:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.