[SERVER-8840] Release distributed locks if shard shuts itself down during a migration Created: 04/Mar/13 Updated: 06/Dec/22 Resolved: 12/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
If a shard primary purposely shuts itself down during a migration (usually because of a problem communicating with the config servers while in the critical section), we should release all distributed locks while shutting down so that future migrations aren't blocked after the shard recovers (from RS failover and/or by manually bringing the primary back online). |
| Comments |
| Comment by Gregory McKeon (Inactive) [ 12/Jul/18 ] | |
|
As of 3.4, the config server now holds the balancer lock. | |
| Comment by Spencer Brody (Inactive) [ 03/Apr/13 ] | |
|
Yep, I'm an idiot, it absolutely does time out after 15 minutes and reclaim the lock. Per conversation with greg_10gen, modifying the focus of the ticket slightly to focus on releasing the lock when we purposely shut down mid-migration. | |
| Comment by Eliot Horowitz (Inactive) [ 02/Apr/13 ] | |
|
All distributed locks timeout, why isn't the regular timeout working in this case? | |
| Comment by Spencer Brody (Inactive) [ 02/Apr/13 ] | |
|
Another option for recovery would be for the shards to have a command that indicates whether they have any in-progress migrations, and the balancer could run that command against the shards involved in the last known migration when considering forcing the lock. | |
| Comment by John Sumsion [ 02/Apr/13 ] | |
|
This thread was useful in recovering from this: | |
| Comment by John Sumsion [ 02/Apr/13 ] | |
|
I hit this while pre-splitting chunks when preparing for a mass-import. In my case the moveChunk() failed with an error like:
After looking in the logs, I saw that the ip189 had gone down during the moveChunk, socket error communicating with one of the config nodes. I'll attach a log from the box that died, and from the config server. I know this is just one of the possible situations that could cause this, but perhaps it's helpful in fixing this. |