[SERVER-19550] Balancer can cause cascading mongod failures during network partitions Created: 23/Jul/15 Updated: 06/Dec/22 Resolved: 09/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Travis Thieman | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | ShardingRoughEdges | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Repro'd with Mongo 3.0.4 WT inside Docker and TCL 6.3 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | Repro can be achieved using the Docker Compose spec and script attached to this ticket. Public Gist of those same files is available here: https://gist.github.com/thieman/e28d426d3415c30caf38 |
||||
| Participants: | |||||
| Description |
|
Under the following conditions, the balancer can cause up to N mongod primary failures, where N is the number of collections currently eligible for balancing. 1. The balancer tries to move a chunk After step 4, the collection lock will still be held on the collection that was being balanced at the time of the crash. This lock will expire after 24 hours, but during that time the collection is no longer eligible for balancing. Critically, this failure cannot cascade to multiple nodes if the first config server is also partitioned from the balancing mongos. When this happens, the balancing mongos is unable to release its balancing lock, so balancing cannot continue. This is a good thing, as it prevents cascading primary failures. Since this failure can only happen if the balancing mongos is still able to read and write from the config server, it seems like some more intelligent reading of the current lock state by that mongos would allow us to prevent the cascading failure. Perhaps something like this? 1. Balancing mongos realizes the migration has failed If the balancing mongos did this, it would prevent the cascading failure and give us 24 hours to respond to the issue before any other primaries could die. |
| Comments |
| Comment by Sheeri Cabral (Inactive) [ 09/Dec/19 ] |
|
gone away since the balancer has moved to the config servers. |