[SERVER-19550] Balancer can cause cascading mongod failures during network partitions Created: 23/Jul/15  Updated: 06/Dec/22  Resolved: 09/Dec/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Travis Thieman Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 0
Labels: ShardingRoughEdges
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Repro'd with Mongo 3.0.4 WT inside Docker and TCL 6.3


Attachments: File mongo-cluster.yml     File test.sh    
Issue Links:
Related
Assigned Teams:
Sharding
Operating System: ALL
Steps To Reproduce:

Repro can be achieved using the Docker Compose spec and script attached to this ticket.

Public Gist of those same files is available here: https://gist.github.com/thieman/e28d426d3415c30caf38

Participants:

 Description   

Under the following conditions, the balancer can cause up to N mongod primary failures, where N is the number of collections currently eligible for balancing.

1. The balancer tries to move a chunk
2. During the moveChunk, the FROM primary enters the critical section
3. During the critical section, the FROM primary is partitioned from the first config server
4. The primary will crash
5. If the balancing mongos is still able to reach the first config server, it will release its balancing lock
6. If 5 happened, goto 1

After step 4, the collection lock will still be held on the collection that was being balanced at the time of the crash. This lock will expire after 24 hours, but during that time the collection is no longer eligible for balancing.

Critically, this failure cannot cascade to multiple nodes if the first config server is also partitioned from the balancing mongos. When this happens, the balancing mongos is unable to release its balancing lock, so balancing cannot continue. This is a good thing, as it prevents cascading primary failures.

Since this failure can only happen if the balancing mongos is still able to read and write from the config server, it seems like some more intelligent reading of the current lock state by that mongos would allow us to prevent the cascading failure. Perhaps something like this?

1. Balancing mongos realizes the migration has failed
2. Balancing mongos checks the collection lock on the collection it was just trying to migrate
3. If that lock is still held, mongos can reason that something has gone wrong and refuse to relinquish its balancing lock as a precaution

If the balancing mongos did this, it would prevent the cascading failure and give us 24 hours to respond to the issue before any other primaries could die.



 Comments   
Comment by Sheeri Cabral (Inactive) [ 09/Dec/19 ]

gone away since the balancer has moved to the config servers.

Generated at Thu Feb 08 03:51:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.