[SERVER-26116] CSRS leaves the balancer lock unprotected briefly between leaving drain mode and the balancer acquiring it Created: 14/Sep/16  Updated: 19/Nov/16  Resolved: 15/Nov/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.4.0-rc4

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Nathan Myers
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2016-10-31, Sharding 2016-11-21
Participants:

 Description   

Draining mode unlocks all of the config server held distributed locks. The Balancer reacquires the {_id: "balancer"} distlock after drain mode has been left.

Theoretically, a 3.2 Mongos could take the {_id: "balancer"} distlock in this window. It could briefly hold the distlock for a manual moveChunk, or indefinitely if auto-balancing is enabled. It could be held indefinitely if the config server is holding collection distlocks for migrations it is trying to recover after failover.



 Comments   
Comment by Githook User [ 15/Nov/16 ]

Author:

{u'username': u'DiannaHohensee', u'name': u'Dianna Hohensee', u'email': u'dianna.hohensee@10gen.com'}

Message: SERVER-26116 reacquire the balancer distlock in drain mode during config primary step-up
Branch: master
https://github.com/mongodb/mongo/commit/eacdb58313a1b464e89c44868527fcadc22a67a6

Comment by Kaloian Manassiev [ 02/Nov/16 ]

The proposed way to fix this is to try to acquire the "balancer" distributed lock first thing after the migration manager enters the recovering state. This acquisition must be done with a local write concern since at that point the server is still in draining mode.

If this aquisition fails for any reason, there are a couple of cases to be considered:

  • There are no migration documents - in this case it is harmless since there are no active migrations to be recovered and the failure most likely happened because a 3.2 mongos after an upgrade had a lease on the lock, which will expire after 15 minutes and the lock will be acquired by the balancer thread, which guarantees that no new migrations will occur until then.
  • There are some migration documents - this is the unexpected case and can only happen due to local write problem, which is a pretty severe condition in itself. In this case, just log a warning and continue recovery, since there is not much that can be done.
Generated at Thu Feb 08 04:11:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.