[SERVER-19467] Race in sh.startBalancer() Created: 17/Jul/15  Updated: 06/Dec/22  Resolved: 30/Nov/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: PM229, balancer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-21766 Remove waiting for balancer lock beha... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

sh.startBalancer() calls updates the balancer document in config.settings, followed by watching for a change in the timestamp of the balancer lock. The sequence is:

  1. setBalancerState to true
  2. Get balancer lock document
  3. Extract the timestamp
  4. assert.soon (up to 30s by default) watching for the timestamp to change

If the balancer manages to start and take the lock between #1 and #2, and starts doing a non-trivial chunk migration, then the timeout will occur. By contrast, if the balancer is slower the lock is taken after #2, then this will not happen, even if there is a non-trivial chunk migration.

The impact of this is low, although the apparent failure of sh.startBalancer() (despite the balancer clearly working) is often confusing (and there are other reasons it can happen).

Better would be to grab the lock document before enabling the balancer, and then pass it through to the (eventual) assert.soon.

sh.stopBalancer() doesn't have this problem, because it waits for the balancer lock state to go to false.


Generated at Thu Feb 08 03:51:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.